Report on Semi-supervised Training for Statistical Parsing Zhang Hao 2002-12-18

Report on Semi-supervised Training for Statistical Parsing

Zhang Hao

2002-12-18

Brief Introduction

Why semi-supervised training?Co-training framework and applicationsCan parsing fit in this framework?How?Conclusion

Why Semi-supervised Training

Compromise between su… and unsu…Pay-offs:

– Minimize the need for labeled data– Maximize the value of unlabeled data– Easy portability

Co-training Scenario

Idea: two different students learn from each other, incrementally, mutually improving

“二人行必有我师”difference(motive) –mutual learning(optimize)-> agreement(objective).

Task: to optimize the objective function of agreement.

Heuristic selection is important: what to learn?

[Blum & Mitchell, 98] Co-training Assumptions

Classification problemFeature redundancy

– Allows different views of data – Each view is sufficient for classification

View independency of features, given class

]|Pr[],|Pr[

]|Pr[],|Pr[

221122

112211

yYxXyYxXxX

yYxXyYxXxX

),( 21 XXX

lxfxfxf )()()( 2211

[Blum & Mitchell, 98] Co-training example

“Course home page” classification (y/n)Two views: content text/anchor text

(more perfect example: two sides of a coin)

Two naïve Bayes classifiers: should agree

[Blum & Mitchell, 98] Co-Training Algorithm

Given:• A set L of labeled training examples• A set U of unlabeled examples

Create a pool U’ of examples by choosing u examples at random from U.

Loop for k iterations:Use L to train a classifier h1 that considers only the x1 portion of xUse L to train a classifier h2 that considers only the x2 portion of xAllow h1 to label p positive and n negative examples from U’Allow h2 to label p positive and n negative examples from U’Add these self-labeled examples to LRandomly choose 2p+2n examples from U to replenish U’

n:p matches the ratio of negtive to positive

examples

The selected examples are those “most

confidently” labeled ones, i.e. heuristic

selection

Family of Algorithms Related to Co-training

Method Feature Split (Yes)

Feature Split (No)

Incremental Co-training Self-training

Iterative Co-EM EM

[Nigam & Ghani 2000]

Parsing As Supertagging and Attaching [Sarkar 2001]

The difference between parsing and other NLP applications:WSD, WBPC, TC, NEI– A tree vs. A label– Composite vs. Monolithic– Large parameter space vs. Small …

LTAG– Each word is tagged with a lexicalized elementary tree

(supertagging)– Parsing is a process of substitution and adjoining of

elementary trees– Supertagger finishes a very large part of job a traditional

parser must do

A glimpse of Suppertags

Two Models to Co-training

H1: selects trees based on previous context (tagging probability model)

H2: computes attachment between trees and returns best parse(parsing probability model)

[Sarkar 2000] Co-training Algorithm

1. Input: labeled and unlabeled2. Update cache

Randomly select sentences from unlabeled and refill cacheIf cache is empty; Exit

3. Train models H1 and H2 using labeled4. Apply H1 and H2 to cache5. Pick most probable n from H1 (run through H2) and

add to labeled6. Pick most probable n from H2 and add to labeled7. n=n+k; Go to step 2

JHU SW2002 tasks

Co-train Collins CFG parser with Sarkar LTAG parser

Co-train RerankersCo-train CCG supertaggers and parsers

Co-training: The AlgorithmRequires:

• Two learners with different views of the task• Cache Manager (CM) to interface with the disparate

learners• A small set of labeled seed data and a larger pool of

unlabelled data

Pseudo-Code:– Init: Train both learners with labeled seed data– Loop:

• CM picks unlabelled data to add to cache• Both learners label cache• CM selects newly labeled data to add to the learners'

respective training sets• Learners re-train

Novel Methods-Parse Selection

Want to select training examples for one parser (student) labeled by the other (teacher) so as to minimize noise and maximize training utility.– Top-n: choose the n examples for which the teacher

assigned the highest scores.– Difference: choose the examples for which the teacher

assigned a higher score than the student by some threshold.

– Intersection: choose the examples that received high scores from the teacher but low scores from the student.

– Disagreement: choose the examples for which the two parsers provided different analyses and the teacher assigned a higher score than the student.

Effect of Parse Selection

CFG-LTAG Co-training

Re-rankers Co-training

What is Re-ranking?– A re-ranker reorders the output of an n-

best (probabilistic) parser based on features of the parse

– While parsers use local features to make decisions, re-rankers use features that can span the entire tree

– Instead of co-training parsers, co-train different re-rankers


Motivation: Why re-rankers?– Speed

• parse data once• reordered many times

– Objective function• The lower runtime of re-rankers allows us to

explicitly maximize agreement between parses


Motivation: Why re-rankers?– Accuracy

• Re-rankers can improve performance of existing parsers

• Collins ’00 cites a 13 percent reduction of error rate by re-ranking

– Task closer to classification• A re-ranker can be seen as a binary classifier: either a

parse is the best for a sentence or it isn’t• This is the original domain cotraining was intended for


Experimental. But much to be explored. Remember: a re-ranker is easier to develop– Reranker 1: Log linear model– Reranker 2: Linear perceptron model

– Room for improvement:

Current best parser: 89.7

Oracle that picks best parse from top 50: 95 +

JHU SW2002 Conclusion

– Largest experimental study to date on the use of unlabelled data for improving parser performance.

– Co-training enhances performance for parsers and taggers trained on small (500—10,000 sentences) amounts of labeled data.

– Co-training can be used for porting parsers trained on one genre to parse on another without any new human-labeled data at all, improving on state-of-the-art for this task.

– Even tiny amounts of human-labelled data for the target genre enhace porting via co-training.

– New methods for Parse Selection have been developed, and play a crucial role.

How to Improve Our Parser?

Similar setting:limited labeled data(Penn CTB) large amount of unlabeled and somewhat deferent domain data(PKU People Daily)

To try:– Re-rankers’ developing cycle is much shorter,

worthy of trying. Many ML techniques may be utilized.

– Re-rankers’ agreement is still an open question

ThanksThanks

Documents

Report on Semi-supervised Training for Statistical Parsing Zhang Hao 2002-12-18