Joint Models with Missing Data for Semi-Supervised Learning

*Jason EisnerNAACL Workshop Keynote June 2009Joint Models with Missing Datafor Semi-Supervised Learning

*Eisner & Tromble - HLT-NAACL Workshop on Computationally Hard Methods and Joint Inference in NLP - June 2006

OutlineWhy use joint models?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semi-supervised training of joint models


The standard storyTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)


Some running examplesTaskp(y|x) modelSemi-sup. learning: Train on many (x,?) and a few (x,y)sentence parse lemma morph. paradigm E.g., in low-resource languages(with David A. Smith)(with Markus Dreyer)


Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)


sample of p(x)


For any x, can now recover cluster c that probably generated itA few supervised examples may let us predict y from cE.g., if p(x,y) = c p(x,y,c) = c p(c) p(y | c) p(x | c)(joint model!)sample of p(x)


Semi-supervised learningWhy would knowing p(x) help you learn p(y|x) ??Semi-sup. learning: Train on many (x,?) and a few (x,y)Picture is misleading: No need to assume a distance metric (as in TSVM, label propagation, etc.)But we do need to choose a model family for p(x,y)Shared parameters via joint modele.g., noisy channel: p(x,y) = p(y) * p(x|y)

Estimate p(x,y) to have appropriate marginal p(x)This affects the conditional distrib p(y|x)


NLP + ML = ???Taskstructured input(may be only partly observed, so infer x, too) structured output(so already need joint inference for decoding, e.g.,dynamic programming)p(y|x) model depends on features of(sparse features?)

or features of where z are latent(so infer z, too)


Each task in a vacuum?


Solved tasks help later ones? (e.g, pipeline)Task1xz1Task2z2Task3z3Task4y


Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?


Feedback?Task1xz1Task2z2Task3z3Task4yWhat if Task3 isnt solved yet and we have little training data?Impute given x1 and y4!


A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4y


A later step benefits from many earlier ones?Task1xz1Task2z2Task3z3Task4yAnd conversely?


We end up with a Markov Random Field (MRF)1xz1z2z3y234


Variable-centric, not task-centric=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)


First, a familiar exampleConditional Random Field (CRF) for POS tagging*Familiar MRF examplefindpreferredtagsvvvPossible tagging (i.e., assignment to remaining variables)Observed input sentence (shaded)


*Familiar MRF exampleFirst, a familiar exampleConditional Random Field (CRF) for POS taggingfindpreferredtagsvanPossible tagging (i.e., assignment to remaining variables)Another possible taggingObserved input sentence (shaded)


*Familiar MRF example: CRFfindpreferredtagsBinary factor that measures compatibility of 2 adjacent tagsModel reuses same parametersat this position

vnav 021n210a031

vnav 021n210a031


*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding wordcant be adj

v 0.2n0.2a0

v 0.2n0.2a0


*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagIts values depend on corresponding word(could be made to depend on entire observed sentence)

v 0.2n0.2a0


*Familiar MRF example: CRFfindpreferredtagsUnary factor evaluates this tagDifferent unary factor at each position

v 0.2n0.2a0

v 0.3n0.02a0

v 0.3n0a0.1


*Familiar MRF example: CRFfindpreferredtagsvanp(v a n) is proportional to the product of allfactors values on v a n

vnav 021n210a031

v 0.3n0.02a0

vnav 021n210a031

v 0.3n0a0.1

v 0.2n0.2a0


*Familiar MRF example: CRFfindpreferredtagsvan= 1*3*0.3*0.1*0.2 p(v a n) is proportional to the product of allfactors values on v a nNOTE: This is not just a pipeline of single-tag prediction tasks(which might work ok in well-trained supervised case )

vnav 021n210a031

v 0.3n0.02a0

vnav 021n210a031

v 0.3n0a0.1

v 0.2n0.2a0


Task-centered view of the worldTask1xz1Task2z2Task3z3Task4y


Variable-centered view of the world=z1z2z3xy=(1/Z)2(z1,z2)4(z3,y)3(x,z1,z2,z3)1(x,z1)p(x,z1,z2,z3,y) 5(y)


Variable-centric, not task-centric

Throw in any variables that might help! Model and exploit correlations


entailmentcorrelationinflectioncognatestransliterationabbreviationneologismlanguage evolutiontranslationalignmenteditingquotationspeech misspellings,typos formatting entanglement annotationNtokens


Back to our (simpler!) running examplessentence parse lemma morph. paradigm (with David A. Smith)(with Markus Dreyer)


Parser projectionsentence parse little direct training datamuch more training data


Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion


Parser projectionsentence parse little direct training datamuch more training data


Parser projectionAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestion


Parser projectionsentence parse little direct training datamuch more training dataneed an interestingmodel


Parses are not entirely isomorphicAufFragediesebekommenichhabeleiderAntwortkeineIdidnotunfortunatelyreceiveananswertothisquestionNULLmonotonicnullhead-swappingsiblings


Dependency Relations+ none of the above


Parser projectionsentence parse translation word-to-wordalignment parse of translation Typical test data (no translation observed):


Parser projectionsentence parse translation word-to-wordalignment parse of translation Small supervised training set (treebank):


Parser projectionsentence parse translation word-to-wordalignment parse of translation Moderate treebank in other language:


Parser projectionsentence parse translation word-to-wordalignment parse of translation Maybe a few gold alignments:


Parser projectionsentence parse translation word-to-wordalignment parse of translation Lots of raw bitext:


Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext,


Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables:


Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse


Parser projectionsentence parse translation word-to-wordalignment parse of translation Given bitext, try to impute other variables: Now we have more constraints on the parse which should help us train the parser.Well see how belief propagation naturally handles this.


English does help us impute Chinese parseIn the area of infrastructure construction, China has begun to utilize loans from international financial organizations to implement international competitive bidding procurementChina: in: infrastructure: construction: area:, : has begun: to utilize: international: financial: organizations: s: loans: to implement: international: competitive: bidding: procurementSeeing noisy output of an English WSJ parser fixes these Chinese linksThe corresponding bad versions found without seeing the English parseSubject attaches to intervening nounNPJNNVVNNNsNVJNNNComplement verbs swap objects


Which does help us train a monolingual Chinese parser


(Could add a 3rd language )sentence parse translation parse of translation alignment


(Could add world knowledge )sentence parse translation word-to-wordalignment parse of translation


(Could add bilingual dictionary )sentence parse translation word-to-wordalignment parse of translation (since incomplete, treat as partially observed var)N


Dynamic Markov Random Fieldsentence parse translation parse of translation alignment Note: These are structured vars

Each is expanded into a collection of fine-grained variables (words, dependency links, alignment links,)

Thus, # of fine-grained variables & factors varies by example (but all examples share a single finite parameter vector)


Back to our running examplessentence parse (with David A. Smith)


Morphological paradigm

infxyz1st Sg2nd Sg3rd Sg1st Pl2nd Pl3rd PlPresentPast


Morphological paradigm

infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast


Morphological paradigm as MRF



# observations per form (fine-grainedsemisupervision)rare!rare!Question: Does joint inference help?undertrained

inf9,3931st Sg28511242nd Sg16643rd Sg141011241st Pl16886732nd Pl127593rd Pl1688673PresentPast


*geltst giltstgeltet geltetgaltst galtest*gltt galtetgelten to hold, to apply

infgelten1st Sggeltegalt2nd Sggiltstgaltstor: galtest3rd Sggiltgalt1st Plgeltengalten2nd Plgeltetgaltet3rd PlgeltengaltenPresentPast


*abbrachten abbrachenabbreche abbrecheabbracht abbrachtabbrecht abbrecht*atttrachst abbrachst*abbrechst abbrichstabbricht abbricht*abbrachten abbrachenabbrechen to quit

infabbrechen1st Sgabbreche or: breche ababbrach or: brach ab2nd Sgabbrichstor: brichst ababbrachstor: brachst ab3rd Sgabbrichtor: bricht ababbrach or: brach ab1st Plabbrechen or: brechen ababbrachenor: brachen ab2nd Plabbrechtor: brecht ababbrachtor: bracht ab3rd Plabbrechen or: brechen ababbrachenor: brachen abPresentPast


gackern to cackle*gackrt gackertet*gackart gackertestgackere gackeregackerst gackerstgackert gackertgackern gackerngackert gackertgackern gackerngackerte gackertegackerte gackertegackerten gackertengackerten gackerten

infgackern1st Sggackeregackerte2nd Sggackerstgackertest3rd Sggackertgackerte1st Plgackerngackerten2nd Plgackertgackertet3rd PlgackerngackertenPresentPast


werfen to throwwarft warft*werfst wirfstwerft werftwarfst warfst

infwerfen1st Sgwerfewarf2nd Sgwirfstwarfst3rd Sgwirftwarf1st Plwerfenwarfen2nd Plwerftwarft3rd PlwerfenwarfenPresentPast


Preliminary results joint inference helps a lot on the rare formsHurts on the others.Can we fix?? (Is it because our joint decoder is approx? Or because semi-supervised training is hard and we need a better method for it?)


OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models


Key Idea!Were using an MRF to coordinate the solutions to several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)

Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with


MRFs great for n-way classification (maxent)Also good for predicting sequences

Also good for dependency parsing

*Why we need approximate inferencealas, forward-backward algorithm only allows n-gram featuresalas, our combinatorial algorithms only allow single-edge features (more interactions slow them down or introduce NP-hardness)

Great Ideas in ML: Message Passing*3 behind you2 behind you1 behind you4 behind you5 behind you1 beforeyou2 beforeyou3 beforeyou4 beforeyou5 beforeyouadapted from MacKay (2003) textbookCount the soldiers


Great Ideas in ML: Message Passing*3 behind you2 beforeyouBelief: Must be2 + 1 + 3 = 6 of usonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook


Great Ideas in ML: Message Passing*4 behind you1 beforeyouonly seemy incomingmessagesCount the soldiersadapted from MacKay (2003) textbook


Great Ideas in ML: Message Passing*7 here3 here11 here(= 7+3+1)Each soldier receives reports from all branches of treeadapted from MacKay (2003) textbook


Great Ideas in ML: Message Passing*7 here3 here3 hereBelief:Must be14 of usEach soldier receives reports from all branches of treeadapted from MacKay (2003) textbook


Great Ideas in ML: Message PassingEach soldier receives reports from all branches of tree*7 here3 here3 hereBelief: Must be14 of uswouldnt work correctly with a loopy (cyclic) graphadapted from MacKay (2003) textbook


*findpreferredtagsGreat ideas in ML: Forward-Backward

beliefmessagemessageIn the CRF, message passing = forward-backward

v 0.3n0a0.1

v 1.8n0a4.2

v 2n1a7

v 7n2a1

v 3n1a6

vnav 021n210a031

v 3n6a1

vnav 021n210a031


Extend CRF to skip chain to capture non-local factorMore influences on belief *findpreferredtagsGreat ideas in ML: Forward-Backward

v 3n1a6

v 2n1a7

v 3n1a6

v 5.4n0a25.2

v 0.3n0a0.1


Extend CRF to skip chain to capture non-local factorMore influences on belief Graph becomes loopy *findpreferredtagsGreat ideas in ML: Forward-Backward

Red messages not independent?Pretend they are!

v 3n1a6

v 2n1a7

v 3n1a6

v 5.4`n0a25.2`

v 0.3n0a0.1


MRF over string-valued variables!



MRF over string-valued variables!What are these messages?Probability distributions over strings Represented by weighted FSAsConstructed by finite-state operationsParameters trainable using finite-state methods

Warning: FSAs can get larger and larger; must prune back using k-best or variational approx



Key Idea!Were using an MRF to coordinate the solutions to several NLP problems

Each factor may be a whole NLP model over one or a few complex structured variables (strings, parses)Or equivalently, over many fine-grained variables (individual words, tags, links)

Within a factor, use existing fast exact NLP algorithmsThese are the propagators that compute outgoing messagesEven though the product of factors may be intractable or even undecidable to work with We just saw this for morphology; now lets see it for parsing


*Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible linksLocal factors in a graphical modelfindpreferredlinks


Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksPossible parse encoded as an assignment to these vars


Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksfftftfPossible parse encoded as an assignment to these varsAnother possible parse


Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinksftttfPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parsef


Back to simple variables CRF for POS taggingNow lets do dependency parsing!O(n2) boolean variables for the possible links*Local factors in a graphical modelfindpreferredlinkstttPossible parse encoded as an assignment to these varsAnother possible parseAn illegal parseAnother illegal parset(multiple parents)ff


*Structured outputs?we have fast algorithms if we only use single-edge features

This does pretty well in supervised learning

Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctouyes, lots of green ...


Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)


Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVC


Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA N


Edge-Factored Parsers (McDonald et al. 2005)Is this a good edge?Byljasnstudendubnovdenahodinyodbjelytinctoujasn den(bright day)jasn N(bright NOUN)VAAANJNVCA Npreceding conjunctionA N


Edge-Factored Parsers (McDonald et al. 2005)How about this competing edge?

ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCnot as good, lots of red ...



ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodiny(bright clocks)

... undertrained ...



ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtinjasn hodi(bright clock, stems only)



ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular



ByljasnstudendubnovdenahodinyodbjelytinctouVAAANJNVCjasn hodi(bright clock, stems only)byljasnstuddubndenahodiodbtinAplural Nsingular A N where N follows a conjunction


Message-passing from the other languageByljasnstudendubnovdenahodinyodbjelytinctouProbably aligns tosome English path N in N Is this a good edge?May help to know the English translation


Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?bright day or bright clocks?jasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin


Edge-Factored Parsers (McDonald et al. 2005)Which edge is better?Score of an edge e = features(e)Standard algos valid parse with max total scorejasnBylstudendubnovdenahodinyodbjelytinctouVAAANJNVCbyljasnstuddubndenahodiodbtin


Hard Constraints on Valid TreesWhich edge is better?Score of an edge e = features(e)Standard algos valid parse with max total score

cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.


Hard Constraints on Valid Trees

Score of an edge e = features(e)Standard algos valid parse with max total score

cant have both (one parent per word)Thus, an edge may lose (or win) because of a consensus of other edges.


Note: Non-Projective ParsestalkThe projectivity constraint.Do we really want it?IgiveaonbootstrappingtomorrowROOTllsubtree rooted at talk is a discontiguous noun phrase

We can enforce this constraint throughout the tree, or drop it fully.(totally different combinatorial algorithms)Would be better to do something in between, but thats NP-hard Some languages use more crossing links than English


*Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!

How could we find best parse?Integer linear programming (Riedel et al., 2006)doesnt give us probabilities when training or parsingMCMCSlow to mix? High rejection rate because of hard TREE constraint?Greedy hill-climbing (McDonald & Pereira 2006)(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *

*Lets reclaim our freedomOutput probability is a product of local factorsThrow in any factors we want!

Let local factors negotiate via belief propagationLinks (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3)

Converges to a pretty good (but approx.) global parse(1/Z) * (A) * (B,A) * (C,A) * (C,B) * (D,A,B) * (D,B,C) * (D,A,C) *


So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolation*Local factors for parsingfindpreferredlinksas before, goodness of this link can depend on entireobserved input contextsome other linksarent as goodgiven this input sentenceBut what if the best assignment isnt a tree??

t 2f1

t 1f2

t 1f2

t 1f6

t 1f3

t 1f8

*Global factors for parsingSo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1findpreferredlinks

ffffff0ffffft0fffftf0fftfft1tttttt0

So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1*Global factors for parsingfindpreferredlinkstfftff64 entries (0/1)werelegal!

ffffff0ffffft0fffftf0fftfft1tttttt0

So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparent*Local factors for parsingfindpreferredlinkstt3

ftf11t13

So what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-cross*Local factors for parsingfindpreferredlinkstbyt

ftf11t10.2

*Local factors for parsingfindpreferredlinksbySo what factors shall we multiply to define parse probability?Unary factors to evaluate each link in isolationGlobal TREE factor to require that the links form a legal treethis is a hard constraint: factor is either 0 or 1Second-order effects: factors on 2 variablesgrandparentno-crosscoordination with other parse & alignmenthidden POS tagssiblingssubcategorization


*Exactly Finding the Best ParseWith arbitrary features, runtime blows upProjective parsing: O(n3) by dynamic programming

Non-projective: O(n2) by minimum spanning treebut to allow fast dynamic programming or MST parsing, only use single-edge features


Two great tastes that taste great togetherYou got dynamic programming in my belief propagation!You got belief propagation in my dynamic programming!


What does parsing have to do with belief propagation?beliefloopypropagation


*Loopy Belief Propagation for ParsingfindpreferredlinksSentence tells word 3, Please be a verbWord 3 tells the 3 7 link, Sorry, then you probably dont existThe 3 7 link tells the Tree factor, Youll have to find another parent for 7The tree factor tells the 10 7 link, Youre on!The 10 7 link tells 10, Could you please be a noun?

Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle Strong links are suppressing or promoting other links *Loopy Belief Propagation for Parsingfindpreferredlinks


Higher-order factors (e.g., Grandparent) induce loopsLets watch a loop around one triangle How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

*Loopy Belief Propagation for Parsingfindpreferredlinks?

TREE factorffffff0ffffft0fffftf0fftfft1tttttt0


How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

*Loopy Belief Propagation for Parsingfindpreferredlinks?

TREE factorffffff0ffffft0fffftf0fftfft1tttttt0


How did we compute outgoing message to green link?Does the TREE factor think that the green link is probably t, given the messages it receives from all the other links?

*Loopy Belief Propagation for ParsingBelief propagation assumes incoming messages to TREE are independent.So outgoing messages can be computed with first-order parsing algorithms (fast, no grammar constant).


Some interesting connections Parser stacking (Nivre & McDonald 2008, Martins et al. 2008)

Global constraints in arc consistencyALLDIFFERENT constraint (Rgin 1994)

Matching constraint in max-product BPFor computer vision (Duchi et al., 2006)Could be used for machine translation

As far as we know, our parser is the first use of global constraints in sum-product BP.And nearly the first use of BP in natural language processing.


Runtimes for each factor type (see paper) +=Additive, not multiplicative!periteration


Runtimes for each factor type (see paper) +=Additive, not multiplicative!Each global factor coordinates an unbounded # of variablesStandard belief propagation would take exponential time to iterate over all configurations of those variablesSee paper for efficient propagators


Dependency AccuracyThe extra, higher-order features help! (non-projective parsing)


Dependency AccuracyThe extra, higher-order features help! (non-projective parsing) exact, slowdoesnt fix enough edges


Time vs. Projective Search ErrorDP 140Compared with O(n4) DPCompared with O(n5) DPiterationsiterationsiterations


*Runtime: BP vs. DPVs. O(n4) DPVs. O(n5) DP


*Summary of MRF parsing by BPOutput probability defined as product of local and global factorsThrow in any factors we want! (log-linear model)Each factor must be fast, but they run independently

Let local factors negotiate via belief propagationEach bit of syntactic structure is influenced by othersSome factors need combinatorial algorithms to compute messages faste.g., existing parsing algorithms using dynamic programmingEach iteration takes total time O(n3) or even O(n2); see paperCompare reranking or stacking

Converges to a pretty good (but approximate) global parseFast parsing for formerly intractable or slow modelsExtra features of these models really do help accuracy


OutlineWhy use joint models in NLP?Making big joint models tractable: Approximate inference and training by loopy belief propagationOpen questions: Semisupervised training of joint models


Training with missing data is hard!Semi-supervised learning of HMMs or PCFGs: ouch!Merialdo: Just stick with the small supervised training setAdding unsupervised data tends to hurtA stronger model helps (McClosky et al. 2007, Cohen et al. 2009)So maybe some hope from good models @ factorsAnd from having lots of factors (i.e., take cues from lots of correlated variables at once; cf. Yarowsky et al.)Nave Bayes would be okay Variables with unknown values cant hurt you.They have no influence on training or decoding.But cant help you, either! And indep. assumptions are flaky.So Id like to keep discussing joint models


Case #1: Missing data that you cant imputesentence parse translation word-to-wordalignment parse of translation Treat like multi-task learning? Shared features between 2 tasks: parse Chinese vs. parse Chinese w/ English translationOr 3 tasks: parse Chinese w/ inferred English gist vs. parse Chinese w/ English translation vs. parse English gist derived from English (supervised)


Case #2: Missing data you can impute, but maybe badly



Case #2: Missing data you can impute, but maybe badlyThis is where simple cases of EM go wrongCould reduce to case #1 and throw away these variablesOr: Damp messages from imputed variables to the extent youre not confident in themRequires confidence estimation. (cf. strapping)Crude versions: Confidence depends in a fixed way on time, or on entropy of belief at that node, or on length of input sentence. But could train a confidence estimator on supervised data to pay attention to all sorts of things!Correspondingly, scale up features for related missing-data tasks since the damped data are partially missing



*Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.*** *****I dont want to learn to do tasks but rather learn to understand whats going on in the language I see;happy to take help from direct or indirect supervision or other resources

Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.***************find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} *find preferred tagsv {v,adj} {v,n} ********Since youre all semisupervised learners, Im not going to give you all the answers.Its helpful just to give you some problems.

Documents

Joint Models with Missing Data for Semi-Supervised Learning