Upload
aleesha-robinson
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Report on Semi-supervised Training for Statistical Parsing
Zhang Hao
2002-12-18
Brief Introduction
Why semi-supervised training?Co-training framework and applicationsCan parsing fit in this framework?How?Conclusion
Why Semi-supervised Training
Compromise between su… and unsu…Pay-offs:
– Minimize the need for labeled data– Maximize the value of unlabeled data– Easy portability
Co-training Scenario
Idea: two different students learn from each other, incrementally, mutually improving
“二人行必有我师”difference(motive) –mutual learning(optimize)-> agreement(objective).
Task: to optimize the objective function of agreement.
Heuristic selection is important: what to learn?
[Blum & Mitchell, 98] Co-training Assumptions
Classification problemFeature redundancy
– Allows different views of data – Each view is sufficient for classification
View independency of features, given class
]|Pr[],|Pr[
]|Pr[],|Pr[
221122
112211
yYxXyYxXxX
yYxXyYxXxX
),( 21 XXX
lxfxfxf )()()( 2211
[Blum & Mitchell, 98] Co-training example
“Course home page” classification (y/n)Two views: content text/anchor text
(more perfect example: two sides of a coin)
Two naïve Bayes classifiers: should agree
[Blum & Mitchell, 98] Co-Training Algorithm
Given:• A set L of labeled training examples• A set U of unlabeled examples
Create a pool U’ of examples by choosing u examples at random from U.
Loop for k iterations:Use L to train a classifier h1 that considers only the x1 portion of xUse L to train a classifier h2 that considers only the x2 portion of xAllow h1 to label p positive and n negative examples from U’Allow h2 to label p positive and n negative examples from U’Add these self-labeled examples to LRandomly choose 2p+2n examples from U to replenish U’
n:p matches the ratio of negtive to positive
examples
The selected examples are those “most
confidently” labeled ones, i.e. heuristic
selection
Family of Algorithms Related to Co-training
Method Feature Split (Yes)
Feature Split (No)
Incremental Co-training Self-training
Iterative Co-EM EM
[Nigam & Ghani 2000]
Parsing As Supertagging and Attaching [Sarkar 2001]
The difference between parsing and other NLP applications:WSD, WBPC, TC, NEI– A tree vs. A label– Composite vs. Monolithic– Large parameter space vs. Small …
LTAG– Each word is tagged with a lexicalized elementary tree
(supertagging)– Parsing is a process of substitution and adjoining of
elementary trees– Supertagger finishes a very large part of job a traditional
parser must do
A glimpse of Suppertags
Two Models to Co-training
H1: selects trees based on previous context (tagging probability model)
H2: computes attachment between trees and returns best parse(parsing probability model)
[Sarkar 2000] Co-training Algorithm
1. Input: labeled and unlabeled2. Update cache
Randomly select sentences from unlabeled and refill cacheIf cache is empty; Exit
3. Train models H1 and H2 using labeled4. Apply H1 and H2 to cache5. Pick most probable n from H1 (run through H2) and
add to labeled6. Pick most probable n from H2 and add to labeled7. n=n+k; Go to step 2
JHU SW2002 tasks
Co-train Collins CFG parser with Sarkar LTAG parser
Co-train RerankersCo-train CCG supertaggers and parsers
Co-training: The AlgorithmRequires:
• Two learners with different views of the task• Cache Manager (CM) to interface with the disparate
learners• A small set of labeled seed data and a larger pool of
unlabelled data
Pseudo-Code:– Init: Train both learners with labeled seed data– Loop:
• CM picks unlabelled data to add to cache• Both learners label cache• CM selects newly labeled data to add to the learners'
respective training sets• Learners re-train
Novel Methods-Parse Selection
Want to select training examples for one parser (student) labeled by the other (teacher) so as to minimize noise and maximize training utility.– Top-n: choose the n examples for which the teacher
assigned the highest scores.– Difference: choose the examples for which the teacher
assigned a higher score than the student by some threshold.
– Intersection: choose the examples that received high scores from the teacher but low scores from the student.
– Disagreement: choose the examples for which the two parsers provided different analyses and the teacher assigned a higher score than the student.
Effect of Parse Selection
CFG-LTAG Co-training
Re-rankers Co-training
What is Re-ranking?– A re-ranker reorders the output of an n-
best (probabilistic) parser based on features of the parse
– While parsers use local features to make decisions, re-rankers use features that can span the entire tree
– Instead of co-training parsers, co-train different re-rankers
Re-rankers Co-training
Motivation: Why re-rankers?– Speed
• parse data once• reordered many times
– Objective function• The lower runtime of re-rankers allows us to
explicitly maximize agreement between parses
Re-rankers Co-training
Motivation: Why re-rankers?– Accuracy
• Re-rankers can improve performance of existing parsers
• Collins ’00 cites a 13 percent reduction of error rate by re-ranking
– Task closer to classification• A re-ranker can be seen as a binary classifier: either a
parse is the best for a sentence or it isn’t• This is the original domain cotraining was intended for
Re-rankers Co-training
Experimental. But much to be explored. Remember: a re-ranker is easier to develop– Reranker 1: Log linear model– Reranker 2: Linear perceptron model
– Room for improvement:
Current best parser: 89.7
Oracle that picks best parse from top 50: 95 +
JHU SW2002 Conclusion
– Largest experimental study to date on the use of unlabelled data for improving parser performance.
– Co-training enhances performance for parsers and taggers trained on small (500—10,000 sentences) amounts of labeled data.
– Co-training can be used for porting parsers trained on one genre to parse on another without any new human-labeled data at all, improving on state-of-the-art for this task.
– Even tiny amounts of human-labelled data for the target genre enhace porting via co-training.
– New methods for Parse Selection have been developed, and play a crucial role.
How to Improve Our Parser?
Similar setting:limited labeled data(Penn CTB) large amount of unlabeled and somewhat deferent domain data(PKU People Daily)
To try:– Re-rankers’ developing cycle is much shorter,
worthy of trying. Many ML techniques may be utilized.
– Re-rankers’ agreement is still an open question
ThanksThanks