If you can't read please download the document
Upload
kosey
View
25
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Joint Models with Missing Data for Semi-Supervised Learning. Jason Eisner NAACL Workshop Keynote – June 2009. 1. Outline. Why use joint models? Making big joint models tractable: Approximate inference and training by loopy belief propagation - PowerPoint PPT Presentation
Citation preview
*Jason EisnerNAACL Workshop Keynote June 2009Joint Models with Missing Datafor Semi-Supervised Learning
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
OutlineWhy use joint models?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semi-supervised training of joint models
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
The standard storyTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Some running examplesTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)sentence parse lemma morph. paradigm E.g., in low-resource languages(with David A. Smith)(with Markus Dreyer)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
sample of p(x)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = c p(x,y,c) = c p(c) p(y | c) p(x | c)(joint model!)sample of p(x)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.)But we do need to choose a model family for p(x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)
Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
NLP + ML = ???Taskstructured input(may be only partly observed, so infer x, too) structured output(so already need joint inference for decoding, e.g.,dynamic programming)p(y|x) model depends on features of(sparse features?)
or features of where z are latent(so infer z, too)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Each task in a vacuum?
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Solved tasks help later ones? (e.g, pipeline)Task1xz1Task2z2Task3z3Task4y
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?Impute given x1 and y4!
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4y
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4yAnd conversely?
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
We end up with a Markov Random Field (MRF)1xz1z2z3y234
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Variable-centric, not task-centric=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
First, a familiar exampleConditional Random Field (CRF) for POS tagging*Familiar MRF examplefindpreferredtagsvvvPossible tagging (i.e., assignment to remaining variables)Observed input sentence (shaded)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF exampleFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvanPossible tagging (i.e., assignment to remaining variables)Another possible taggingObserved input sentence (shaded)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsBinary factor that measures compatibility of 2 adjacent tagsModel reuses same parametersat this position
vnav 021n210a031
vnav 021n210a031
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding wordcant be adj
v 0.2n0.2a0
v 0.2n0.2a0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding word(could be made to depend on entire observed sentence)
v 0.2n0.2a0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagDifferent unary factor at each position
v 0.2n0.2a0
v 0.3n0.02a0
v 0.3n0a0.1
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsvanp(v a n) is proportional to the product of allfactors values on v a n
vnav 021n210a031
v 0.3n0.02a0
vnav 021n210a031
v 0.3n0a0.1
v 0.2n0.2a0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Familiar MRF example: CRFfindpreferredtagsvan= 1*3*0.3*0.1*0.2 p(v a n) is proportional to the product of allfactors values on v a nNOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case )
vnav 021n210a031
v 0.3n0.02a0
vnav 021n210a031
v 0.3n0a0.1
v 0.2n0.2a0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Task-centered view of the worldTask1xz1Task2z2Task3z3Task4y
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Variable-centered view of the world=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Variable-centric, not task-centric
Throw in any variables that might help! Model and exploit correlations
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
entailmentcorrelationinflectioncognatestransliterationabbreviationneologismlanguage evolutiontranslationalignmenteditingquotationspeech misspellings,typos formatting entanglement annotationNtokens
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to our (simpler!) running examplessentence parse lemma morph. paradigm (with David A. Smith)(with Markus Dreyer)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse little direct training datamuch more training data
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse little direct training datamuch more training data
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse little direct training datamuch more training dataneed an interestingmodel
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parses are not entirely isomorphicAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestionNULLmonotonicnullhead-swappingsiblings
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Dependency Relations+ none of the above
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Typical test data (no translation observed):
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Small supervised training set (treebank):
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Moderate treebank in other language:
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Maybe a few gold alignments:
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Lots of raw bitext:
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext,
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables:
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse which should help us train the parser.Well see how belief propagation naturally handles this.
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
English does help us impute Chinese parseIn the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurementChina: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: s: loans: to implement: international: competitive: bidding: procurementSeeing noisy output of an English WSJ parser fixes these Chinese linksThe corresponding bad versions found without seeing the English parseSubject attaches to intervening nounNPJNNVVNNNsNVJNNNComplement verbs swap objects
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Which does help us train a monolingual Chinese parser
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
(Could add a 3rd language )sentence parse translation parse of translation alignment
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
(Could add world knowledge )sentence parse translation word-to-wordalignment parse of translation
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
(Could add bilingual dictionary )sentence parse translation word-to-wordalignment parse of translation (since incomplete, treat as partially observed var)N
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Dynamic Markov Random Fieldsentence parse translation parse of translation alignment Note: These are structured vars
Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,)
Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to our running examplessentence parse (with David A. Smith)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Morphological paradigm
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Morphological paradigm
infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Morphological paradigm as MRF
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
# observations per form (fine-grainedsemisupervision)rare!rare!Question: Does joint inference help?undertrained
inf9,3931st Sg28511242nd Sg16643rd Sg141011241st Pl16886732nd Pl127593rd Pl1688673PresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*geltst giltstgeltet geltetgaltst galtest*gltt galtetgelten to hold, to apply
infgelten1st Sggeltegalt2nd Sggiltstgaltstor: galtest3rd Sggiltgalt1st Plgeltengalten2nd Plgeltetgaltet3rd PlgeltengaltenPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*abbrachten abbrachenabbreche abbrecheabbracht abbrachtabbrecht abbrecht*atttrachst abbrachst*abbrechst abbrichstabbricht abbricht*abbrachten abbrachenabbrechen to quit
infabbrechen1st Sgabbreche or: breche ababbrach or: brach ab2nd Sgabbrichstor: brichst ababbrachstor: brachst ab3rd Sgabbrichtor: bricht ababbrach or: brach ab1st Plabbrechen or: brechen ababbrachenor: brachen ab2nd Plabbrechtor: brecht ababbrachtor: bracht ab3rd Plabbrechen or: brechen ababbrachenor: brachen abPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
gackern to cackle*gackrt gackertet*gackart gackertestgackere gackeregackerst gackerstgackert gackertgackern gackerngackert gackertgackern gackerngackerte gackertegackerte gackertegackerten gackertengackerten gackerten
infgackern1st Sggackeregackerte2nd Sggackerstgackertest3rd Sggackertgackerte1st Plgackerngackerten2nd Plgackertgackertet3rd PlgackerngackertenPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
werfen to throwwarft warft*werfst wirfstwerft werftwarfst warfst
infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Preliminary results joint inference helps a lot on the rare formsHurts on the others.Can we fix?? (Is it because our joint decoder is approx? Or because semi-supervised training is hard and we need a better method for it?)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Key Idea!Were using an MRF to coordinate the solutions to several NLP problems
Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)
Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
MRFs great for n-way classification (maxent)Also good for predicting sequences
Also good for dependency parsing
*Why we need approximate inferencealas, forward-backward algorithm only allows n-gram featuresalas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)
Great Ideas in ML: Message Passing*3 behind you2 behind you1 behind you4 behind you5 behind you1 beforeyou2 beforeyou3 beforeyou4 beforeyou5 beforeyouadapted from MacKay (2003) textbookCount the soldiers
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*3 behind you2 beforeyouBelief: Must be2 + 1 + 3 = 6 of usonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*4 behind you1 beforeyouonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*7 here3 here11 here(= 7+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*3 here3 here7 here(= 3+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*7 here3 here11 here(= 7+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message Passing*7 here3 here3 hereBelief:Must be14 of usEach soldier receives reports from all branches of treeadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree*7 here3 here3 hereBelief: Must be14 of uswouldnt work correctly with a loopy (cyclic) graphadapted from MacKay (2003) textbook
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*findpreferredtagsGreat ideas in ML: Forward-Backward
beliefmessagemessageIn the CRF, message passing = forward-backward
v 0.3n0a0.1
v 1.8n0a4.2
v 2n1a7
v 7n2a1
v 3n1a6
vnav 021n210a031
v 3n6a1
vnav 021n210a031
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Extend CRF to skip chain to capture non-local factorMore influences on belief *findpreferredtagsGreat ideas in ML: Forward-Backward
v 3n1a6
v 2n1a7
v 3n1a6
v 5.4n0a25.2
v 0.3n0a0.1
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Extend CRF to skip chain to capture non-local factorMore influences on belief Graph becomes loopy *findpreferredtagsGreat ideas in ML: Forward-Backward
Red messages not independent?Pretend they are!
v 3n1a6
v 2n1a7
v 3n1a6
v 5.4`n0a25.2`
v 0.3n0a0.1
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
MRF over string-valued variables!
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
MRF over string-valued variables!What are these messages?Probability distributions over strings Represented by weighted FSAsConstructed by finite-state operationsParameters trainable using finite-state methods
Warning: FSAs can get larger and larger; must prune back using k-best or variational approx
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Key Idea!Were using an MRF to coordinate the solutions to several NLP problems
Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)
Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with We just saw this for morphology; now lets see it for parsing
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible linksLocal factors in a graphical modelfindpreferredlinks
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksPossible parse encoded as an assignment to these vars
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksfftftfPossible parse encoded as an assignment to these varsAnother possible parse
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksftttfPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parsef
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinkstttPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parset(multiple parents)ff
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Structured outputs?we have fast algorithms if we only use single-edge features
This does pretty well in supervised learning
Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctouyes, lots of green ...
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVC
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA N
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA Npreceding conjunctionA N
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?
ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCnot as good, lots of red ...
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?
ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodiny(bright clocks)
... undertrained ...
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?
ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtinjasn hodi(bright clock, stems only)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?
ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?
ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular A N where N follows a conjunction
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Message-passing from the other languageByljasnstudendubnovdenahodinyodbjelytinctouProbably aligns tosome English path N in N Is this a good edge?May help to know the English translation
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?bright day or bright clocks?jasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?Score of an edge e = features(e)Standard algos valid parse with max total scorejasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Hard Constraints on Valid TreesWhich edge is better?Score of an edge e = features(e)Standard algos valid parse with max total score
cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Hard Constraints on Valid Trees
Score of an edge e = features(e)Standard algos valid parse with max total score
cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Note: Non-Projective ParsestalkThe projectivity constraint.Do we really want it?IgiveaonbootstrappingtomorrowROOTllsubtree rooted at talk is a discontiguous noun phrase
We can enforce this constraint throughout the tree, or drop it fully.(totally different combinatorial algorithms)Would be better to do something in between, but thats NP-hard Some languages use more crossing links than English
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!
How could we find best parse?Integer linear programming (Riedel et al., 2006)doesnt give us probabilities when training or parsingMCMCSlow to mix? High rejection rate because of hard TREE constraint?Greedy hill-climbing (McDonald & Pereira 2006)(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *
*Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!
Let local factors negotiate via belief propagationLinks (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3)
Converges to a pretty good (but approx.) global parse(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolation*Local factors for parsingfindpreferredlinksas before, goodness of this link can depend on entireobserved input contextsome other linksarent as goodgiven this input sentenceBut what if the best assignment isnt a tree??
t 2f1
t 1f2
t 1f2
t 1f6
t 1f3
t 1f8
*Global factors for parsingSo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1findpreferredlinks
ffffff0ffffft0fffftf0fftfft1tttttt0
So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1*Global factors for parsingfindpreferredlinkstfftff64 entries (0/1)werelegal!
ffffff0ffffft0fffftf0fftfft1tttttt0
So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparent*Local factors for parsingfindpreferredlinkstt3
ftf11t13
So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-cross*Local factors for parsingfindpreferredlinkstbyt
ftf11t10.2
*Local factors for parsingfindpreferredlinksbySo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-crosscoordination with other parse & alignmenthidden POS tagssiblingssubcategorization
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Exactly Finding the Best ParseWith arbitrary features, runtime blows upProjective parsing: O(n3) by dynamic programming
Non-projective: O(n2) by minimum spanning treebut to allow fast dynamic programming or MST parsing, only use single-edge features
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Two great tastes that taste great togetherYou got dynamic programming in my belief propagation!You got belief propagation in my dynamic programming!
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
What does parsing have to do with belief propagation?beliefloopypropagation
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Loopy Belief Propagation for ParsingfindpreferredlinksSentence tells word 3, Please be a verbWord 3 tells the 3 7 link, Sorry, then you probably dont existThe 3 7 link tells the Tree factor, Youll have to find another parent for 7The tree factor tells the 10 7 link, Youre on!The 10 7 link tells 10, Could you please be a noun?
Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle Strong links are suppressing or promoting other links *Loopy Belief Propagation for Parsingfindpreferredlinks
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?
*Loopy Belief Propagation for Parsingfindpreferredlinks?
TREE factorffffff0ffffft0fffftf0fftfft1tttttt0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?
*Loopy Belief Propagation for Parsingfindpreferredlinks?
TREE factorffffff0ffffft0fffftf0fftfft1tttttt0
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?
*Loopy Belief Propagation for ParsingBelief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Some interesting connections Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)
Global constraints in arc consistencyALLDIFFERENT constraint (Rgin 1994)
Matching constraint in max-product BPFor computer vision (Duchi et al., 2006)Could be used for machine translation
As far as we know, our parser is the first use of global constraints in sum-product BP.And nearly the first use of BP in natural language processing.
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Runtimes for each factor type (see paper) +=Additive, not multiplicative!periteration
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Runtimes for each factor type (see paper) +=Additive, not multiplicative!Each global factor coordinates an unbounded # of variablesStandard belief propagation would take exponential time to iterate over all configurations of those variablesSee paper for efficient propagators
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Dependency AccuracyThe extra, higher-order features help! (non-projective parsing) exact, slowdoesnt fix enough edges
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Time vs. Projective Search ErrorDP 140Compared with O(n4) DPCompared with O(n5) DPiterationsiterationsiterations
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Runtime: BP vs. DPVs. O(n4) DPVs. O(n5) DP
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Summary of MRF parsing by BPOutput probability defined as product of local and global factorsThrow in any factors we want! (log-linear model)Each factor must be fast, but they run independently
Let local factors negotiate via belief propagationEach bit of syntactic structure is influenced by othersSome factors need combinatorial algorithms to compute messages faste.g., existing parsing algorithms using dynamic programmingEach iteration takes total time O(n3) or even O(n2); see paperCompare reranking or stacking
Converges to a pretty good (but approximate) global parseFast parsing for formerly intractable or slow modelsExtra features of these models really do help accuracy
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Training with missing data is hard!Semi-supervised learning of HMMs or PCFGs: ouch!Merialdo: Just stick with the small supervised training setAdding unsupervised data tends to hurtA stronger model helps (McClosky et al. 2007, Cohen et al. 2009)So maybe some hope from good models @ factorsAnd from having lots of factors (i.e., take cues from lots of correlated variables at once; cf. Yarowsky et al.)Nave Bayes would be okay Variables with unknown values cant hurt you.They have no influence on training or decoding.But cant help you, either! And indep. assumptions are flaky.So Id like to keep discussing joint models
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Case #1: Missing data that you cant imputesentence parse translation word-to-wordalignment parse of translation Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Case #2: Missing data you can impute, but maybe badly
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
Case #2: Missing data you can impute, but maybe badlyThis is where simple cases of EM go wrongCould reduce to case #1 and throw away these variablesOr: Damp messages from imputed variables to the extent youre not confident in themRequires confidence estimation. (cf. strapping)Crude versions: Confidence depends in a fixed way on time, or on entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to pay attention to all sorts of things!Correspondingly, scale up features for related missing-data tasks since the damped data are partially missing
infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast
*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006
*Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.*** *****I dont want to learn to do tasks but rather learn to understand whats going on in the language I see;happy to take help from direct or indirect supervision or other resources
Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.***************find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} ********Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.