Joint Models with Missing Data for Semi-Supervised Learning

  • Upload
    kosey

  • View
    25

  • Download
    2

Embed Size (px)

DESCRIPTION

Joint Models with Missing Data for Semi-Supervised Learning. Jason Eisner NAACL Workshop Keynote – June 2009. 1. Outline. Why use joint models? Making big joint models tractable: Approximate inference and training by loopy belief propagation - PowerPoint PPT Presentation

Citation preview

  • *Jason EisnerNAACL Workshop Keynote June 2009Joint Models with Missing Datafor Semi-Supervised Learning

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • OutlineWhy use joint models?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semi-supervised training of joint models

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • The standard storyTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Some running examplesTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)sentence parse lemma morph. paradigm E.g., in low-resource languages(with David A. Smith)(with Markus Dreyer)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)

    Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • sample of p(x)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = c p(x,y,c) = c p(c) p(y | c) p(x | c)(joint model!)sample of p(x)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.)But we do need to choose a model family for p(x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)

    Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • NLP + ML = ???Taskstructured input(may be only partly observed, so infer x, too) structured output(so already need joint inference for decoding, e.g.,dynamic programming)p(y|x) model depends on features of(sparse features?)

    or features of where z are latent(so infer z, too)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Each task in a vacuum?

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Solved tasks help later ones? (e.g, pipeline)Task1xz1Task2z2Task3z3Task4y

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?Impute given x1 and y4!

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4y

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4yAnd conversely?

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • We end up with a Markov Random Field (MRF)1xz1z2z3y234

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Variable-centric, not task-centric=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • First, a familiar exampleConditional Random Field (CRF) for POS tagging*Familiar MRF examplefindpreferredtagsvvvPossible tagging (i.e., assignment to remaining variables)Observed input sentence (shaded)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF exampleFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvanPossible tagging (i.e., assignment to remaining variables)Another possible taggingObserved input sentence (shaded)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsBinary factor that measures compatibility of 2 adjacent tagsModel reuses same parametersat this position

    vnav 021n210a031

    vnav 021n210a031

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding wordcant be adj

    v 0.2n0.2a0

    v 0.2n0.2a0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding word(could be made to depend on entire observed sentence)

    v 0.2n0.2a0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagDifferent unary factor at each position

    v 0.2n0.2a0

    v 0.3n0.02a0

    v 0.3n0a0.1

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsvanp(v a n) is proportional to the product of allfactors values on v a n

    vnav 021n210a031

    v 0.3n0.02a0

    vnav 021n210a031

    v 0.3n0a0.1

    v 0.2n0.2a0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Familiar MRF example: CRFfindpreferredtagsvan= 1*3*0.3*0.1*0.2 p(v a n) is proportional to the product of allfactors values on v a nNOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case )

    vnav 021n210a031

    v 0.3n0.02a0

    vnav 021n210a031

    v 0.3n0a0.1

    v 0.2n0.2a0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Task-centered view of the worldTask1xz1Task2z2Task3z3Task4y

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Variable-centered view of the world=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Variable-centric, not task-centric

    Throw in any variables that might help! Model and exploit correlations

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • entailmentcorrelationinflectioncognatestransliterationabbreviationneologismlanguage evolutiontranslationalignmenteditingquotationspeech misspellings,typos formatting entanglement annotationNtokens

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to our (simpler!) running examplessentence parse lemma morph. paradigm (with David A. Smith)(with Markus Dreyer)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse little direct training datamuch more training data

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse little direct training datamuch more training data

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse little direct training datamuch more training dataneed an interestingmodel

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parses are not entirely isomorphicAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestionNULLmonotonicnullhead-swappingsiblings

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Dependency Relations+ none of the above

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Typical test data (no translation observed):

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Small supervised training set (treebank):

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Moderate treebank in other language:

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Maybe a few gold alignments:

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Lots of raw bitext:

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext,

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables:

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse which should help us train the parser.Well see how belief propagation naturally handles this.

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • English does help us impute Chinese parseIn the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurementChina: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: s: loans: to implement: international: competitive: bidding: procurementSeeing noisy output of an English WSJ parser fixes these Chinese linksThe corresponding bad versions found without seeing the English parseSubject attaches to intervening nounNPJNNVVNNNsNVJNNNComplement verbs swap objects

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Which does help us train a monolingual Chinese parser

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • (Could add a 3rd language )sentence parse translation parse of translation alignment

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • (Could add world knowledge )sentence parse translation word-to-wordalignment parse of translation

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • (Could add bilingual dictionary )sentence parse translation word-to-wordalignment parse of translation (since incomplete, treat as partially observed var)N

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Dynamic Markov Random Fieldsentence parse translation parse of translation alignment Note: These are structured vars

    Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,)

    Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to our running examplessentence parse (with David A. Smith)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Morphological paradigm

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Morphological paradigm

    infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Morphological paradigm as MRF

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • # observations per form (fine-grainedsemisupervision)rare!rare!Question: Does joint inference help?undertrained

    inf9,3931st Sg28511242nd Sg16643rd Sg141011241st Pl16886732nd Pl127593rd Pl1688673PresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *geltst giltstgeltet geltetgaltst galtest*gltt galtetgelten to hold, to apply

    infgelten1st Sggeltegalt2nd Sggiltstgaltstor: galtest3rd Sggiltgalt1st Plgeltengalten2nd Plgeltetgaltet3rd PlgeltengaltenPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *abbrachten abbrachenabbreche abbrecheabbracht abbrachtabbrecht abbrecht*atttrachst abbrachst*abbrechst abbrichstabbricht abbricht*abbrachten abbrachenabbrechen to quit

    infabbrechen1st Sgabbreche or: breche ababbrach or: brach ab2nd Sgabbrichstor: brichst ababbrachstor: brachst ab3rd Sgabbrichtor: bricht ababbrach or: brach ab1st Plabbrechen or: brechen ababbrachenor: brachen ab2nd Plabbrechtor: brecht ababbrachtor: bracht ab3rd Plabbrechen or: brechen ababbrachenor: brachen abPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • gackern to cackle*gackrt gackertet*gackart gackertestgackere gackeregackerst gackerstgackert gackertgackern gackerngackert gackertgackern gackerngackerte gackertegackerte gackertegackerten gackertengackerten gackerten

    infgackern1st Sggackeregackerte2nd Sggackerstgackertest3rd Sggackertgackerte1st Plgackerngackerten2nd Plgackertgackertet3rd PlgackerngackertenPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • werfen to throwwarft warft*werfst wirfstwerft werftwarfst warfst

    infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Preliminary results joint inference helps a lot on the rare formsHurts on the others.Can we fix?? (Is it because our joint decoder is approx? Or because semi-supervised training is hard and we need a better method for it?)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Key Idea!Were using an MRF to coordinate the solutions to several NLP problems

    Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)

    Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • MRFs great for n-way classification (maxent)Also good for predicting sequences

    Also good for dependency parsing

    *Why we need approximate inferencealas, forward-backward algorithm only allows n-gram featuresalas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)

  • Great Ideas in ML: Message Passing*3 behind you2 behind you1 behind you4 behind you5 behind you1 beforeyou2 beforeyou3 beforeyou4 beforeyou5 beforeyouadapted from MacKay (2003) textbookCount the soldiers

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*3 behind you2 beforeyouBelief: Must be2 + 1 + 3 = 6 of usonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*4 behind you1 beforeyouonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*7 here3 here11 here(= 7+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*3 here3 here7 here(= 3+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*7 here3 here11 here(= 7+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message Passing*7 here3 here3 hereBelief:Must be14 of usEach soldier receives reports from all branches of treeadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree*7 here3 here3 hereBelief: Must be14 of uswouldnt work correctly with a loopy (cyclic) graphadapted from MacKay (2003) textbook

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *findpreferredtagsGreat ideas in ML: Forward-Backward

    beliefmessagemessageIn the CRF, message passing = forward-backward

    v 0.3n0a0.1

    v 1.8n0a4.2

    v 2n1a7

    v 7n2a1

    v 3n1a6

    vnav 021n210a031

    v 3n6a1

    vnav 021n210a031

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Extend CRF to skip chain to capture non-local factorMore influences on belief *findpreferredtagsGreat ideas in ML: Forward-Backward

    v 3n1a6

    v 2n1a7

    v 3n1a6

    v 5.4n0a25.2

    v 0.3n0a0.1

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Extend CRF to skip chain to capture non-local factorMore influences on belief Graph becomes loopy *findpreferredtagsGreat ideas in ML: Forward-Backward

    Red messages not independent?Pretend they are!

    v 3n1a6

    v 2n1a7

    v 3n1a6

    v 5.4`n0a25.2`

    v 0.3n0a0.1

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • MRF over string-valued variables!

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • MRF over string-valued variables!What are these messages?Probability distributions over strings Represented by weighted FSAsConstructed by finite-state operationsParameters trainable using finite-state methods

    Warning: FSAs can get larger and larger; must prune back using k-best or variational approx

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Key Idea!Were using an MRF to coordinate the solutions to several NLP problems

    Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)

    Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with We just saw this for morphology; now lets see it for parsing

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible linksLocal factors in a graphical modelfindpreferredlinks

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksPossible parse encoded as an assignment to these vars

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksfftftfPossible parse encoded as an assignment to these varsAnother possible parse

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksftttfPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parsef

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinkstttPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parset(multiple parents)ff

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Structured outputs?we have fast algorithms if we only use single-edge features

    This does pretty well in supervised learning

  • Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctouyes, lots of green ...

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVC

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA N

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA Npreceding conjunctionA N

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

    ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCnot as good, lots of red ...

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

    ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodiny(bright clocks)

    ... undertrained ...

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

    ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtinjasn hodi(bright clock, stems only)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

    ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

    ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular A N where N follows a conjunction

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Message-passing from the other languageByljasnstudendubnovdenahodinyodbjelytinctouProbably aligns tosome English path N in N Is this a good edge?May help to know the English translation

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?bright day or bright clocks?jasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?Score of an edge e = features(e)Standard algos valid parse with max total scorejasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Hard Constraints on Valid TreesWhich edge is better?Score of an edge e = features(e)Standard algos valid parse with max total score

    cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Hard Constraints on Valid Trees

    Score of an edge e = features(e)Standard algos valid parse with max total score

    cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Note: Non-Projective ParsestalkThe projectivity constraint.Do we really want it?IgiveaonbootstrappingtomorrowROOTllsubtree rooted at talk is a discontiguous noun phrase

    We can enforce this constraint throughout the tree, or drop it fully.(totally different combinatorial algorithms)Would be better to do something in between, but thats NP-hard Some languages use more crossing links than English

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!

    How could we find best parse?Integer linear programming (Riedel et al., 2006)doesnt give us probabilities when training or parsingMCMCSlow to mix? High rejection rate because of hard TREE constraint?Greedy hill-climbing (McDonald & Pereira 2006)(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *

  • *Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!

    Let local factors negotiate via belief propagationLinks (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3)

    Converges to a pretty good (but approx.) global parse(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolation*Local factors for parsingfindpreferredlinksas before, goodness of this link can depend on entireobserved input contextsome other linksarent as goodgiven this input sentenceBut what if the best assignment isnt a tree??

    t 2f1

    t 1f2

    t 1f2

    t 1f6

    t 1f3

    t 1f8

  • *Global factors for parsingSo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1findpreferredlinks

    ffffff0ffffft0fffftf0fftfft1tttttt0

  • So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1*Global factors for parsingfindpreferredlinkstfftff64 entries (0/1)werelegal!

    ffffff0ffffft0fffftf0fftfft1tttttt0

  • So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparent*Local factors for parsingfindpreferredlinkstt3

    ftf11t13

  • So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-cross*Local factors for parsingfindpreferredlinkstbyt

    ftf11t10.2

  • *Local factors for parsingfindpreferredlinksbySo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-crosscoordination with other parse & alignmenthidden POS tagssiblingssubcategorization

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Exactly Finding the Best ParseWith arbitrary features, runtime blows upProjective parsing: O(n3) by dynamic programming

    Non-projective: O(n2) by minimum spanning treebut to allow fast dynamic programming or MST parsing, only use single-edge features

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Two great tastes that taste great togetherYou got dynamic programming in my belief propagation!You got belief propagation in my dynamic programming!

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • What does parsing have to do with belief propagation?beliefloopypropagation

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Loopy Belief Propagation for ParsingfindpreferredlinksSentence tells word 3, Please be a verbWord 3 tells the 3 7 link, Sorry, then you probably dont existThe 3 7 link tells the Tree factor, Youll have to find another parent for 7The tree factor tells the 10 7 link, Youre on!The 10 7 link tells 10, Could you please be a noun?

  • Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle Strong links are suppressing or promoting other links *Loopy Belief Propagation for Parsingfindpreferredlinks

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

    *Loopy Belief Propagation for Parsingfindpreferredlinks?

    TREE factorffffff0ffffft0fffftf0fftfft1tttttt0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

    *Loopy Belief Propagation for Parsingfindpreferredlinks?

    TREE factorffffff0ffffft0fffftf0fftfft1tttttt0

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

    *Loopy Belief Propagation for ParsingBelief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Some interesting connections Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

    Global constraints in arc consistencyALLDIFFERENT constraint (Rgin 1994)

    Matching constraint in max-product BPFor computer vision (Duchi et al., 2006)Could be used for machine translation

    As far as we know, our parser is the first use of global constraints in sum-product BP.And nearly the first use of BP in natural language processing.

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Runtimes for each factor type (see paper) +=Additive, not multiplicative!periteration

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Runtimes for each factor type (see paper) +=Additive, not multiplicative!Each global factor coordinates an unbounded # of variablesStandard belief propagation would take exponential time to iterate over all configurations of those variablesSee paper for efficient propagators

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Dependency AccuracyThe extra, higher-order features help! (non-projective parsing) exact, slowdoesnt fix enough edges

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Time vs. Projective Search ErrorDP 140Compared with O(n4) DPCompared with O(n5) DPiterationsiterationsiterations

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Runtime: BP vs. DPVs. O(n4) DPVs. O(n5) DP

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • *Summary of MRF parsing by BPOutput probability defined as product of local and global factorsThrow in any factors we want! (log-linear model)Each factor must be fast, but they run independently

    Let local factors negotiate via belief propagationEach bit of syntactic structure is influenced by othersSome factors need combinatorial algorithms to compute messages faste.g., existing parsing algorithms using dynamic programmingEach iteration takes total time O(n3) or even O(n2); see paperCompare reranking or stacking

    Converges to a pretty good (but approximate) global parseFast parsing for formerly intractable or slow modelsExtra features of these models really do help accuracy

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Training with missing data is hard!Semi-supervised learning of HMMs or PCFGs: ouch!Merialdo: Just stick with the small supervised training setAdding unsupervised data tends to hurtA stronger model helps (McClosky et al. 2007, Cohen et al. 2009)So maybe some hope from good models @ factorsAnd from having lots of factors (i.e., take cues from lots of correlated variables at once; cf. Yarowsky et al.)Nave Bayes would be okay Variables with unknown values cant hurt you.They have no influence on training or decoding.But cant help you, either! And indep. assumptions are flaky.So Id like to keep discussing joint models

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Case #1: Missing data that you cant imputesentence parse translation word-to-wordalignment parse of translation Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Case #2: Missing data you can impute, but maybe badly

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

  • Case #2: Missing data you can impute, but maybe badlyThis is where simple cases of EM go wrongCould reduce to case #1 and throw away these variablesOr: Damp messages from imputed variables to the extent youre not confident in themRequires confidence estimation. (cf. strapping)Crude versions: Confidence depends in a fixed way on time, or on entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to pay attention to all sorts of things!Correspondingly, scale up features for related missing-data tasks since the damped data are partially missing

    infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast

    *Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

    *Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.*** *****I dont want to learn to do tasks but rather learn to understand whats going on in the language I see;happy to take help from direct or indirect supervision or other resources

    Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.***************find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} ********Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.