Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012

Semi-Supervised Learning & Summary

Advanced Statistical Methods in NLPLing 572

March 8, 2012

2

RoadmapSemi-supervised learning:

Motivation & perspective

Yarowsky’s modelCo-training

Summary

3

Semi-supervised Learning

4

MotivationSupervised learning:

5


Works really well But need lots of labeled training data

Unsupervised learning:

6



Unsupervised learning:No labeled data required, butMay not work well, may not learn desired

distinctions

7



Unsupervised learning:No labeled data required, butMay not work well, may not learn desired

distinctions

E.g. Unsupervised parsing techniquesFits data, but doesn’t correspond to linguistic

intuition

8

SolutionSemi-supervised learning:

9


General idea:Use a small amount of labeled training data

10


General idea:Use a small amount of labeled training dataAugment with large amount of unlabeled training

data Use information in unlabeled data to improve models

11




Many different semi-supervised machine learnersVariants of supervised techniques:

Semi-supervised SVMs, CRFs, etc

12




Many different semi-supervised machine learnersVariants of supervised techniques:

Semi-supervised SVMs, CRFs, etc

Bootstrapping approaches Yarowsky’s method, self-training, co-training

13

There are more kinds of plants and animals in the rainforests than anywhere else on Earth. Over half of the millions of known species of plants and animals live in the rainforest. Many are found nowhere else. There are even plants and animals in therainforest that we have not yet discovered.Biological Example

The Paulus company was founded in 1938. Since those days the product range has been the subject of constant expansions and is brought up continuously to correspond with the state of the art. We’re engineering, manufacturing and commissioning world-wide ready-to-run plants packed with our comprehensive know-how. Our Product Range includes pneumatic conveying systems for carbon, carbide, sand, lime andmany others. We use reagent injection in molten metal for the…Industrial Example

Label the First Use of “Plant”

14

Word Sense Disambiguation

Application of lexical semantics

Goal: Given a word in context, identify the appropriate senseE.g. plants and animals in the rainforest

Crucial for real syntactic & semantic analysis

15




Crucial for real syntactic & semantic analysisCorrect sense can determine

.

16




Crucial for real syntactic & semantic analysisCorrect sense can determine

Available syntactic structureAvailable thematic roles, correct meaning,..

17

Disambiguation FeaturesKey: What are the features?

18

Disambiguation FeaturesKey: What are the features?

Part of speech Of word and neighbors

Morphologically simplified formWords in neighborhood

Question: How big a neighborhood? Is there a single optimal size? Why?

(Possibly shallow) Syntactic analysisE.g. predicate-argument relations, modification, phrases

Collocation vs co-occurrence featuresCollocation: words in specific relation: p-a, 1 word +/-Co-occurrence: bag of words..

19

WSD Evaluation

20

WSD EvaluationIdeally, end-to-end evaluation with WSD

componentDemonstrate real impact of technique in systemDifficult, expensive, still application specific

21

WSD EvaluationIdeally, end-to-end evaluation with WSD

componentDemonstrate real impact of technique in systemDifficult, expensive, still application specific

Typically, intrinsic, sense-basedAccuracy, precision, recallSENSEVAL/SEMEVAL: all words, lexical sample

22

WSD Evaluation Ideally, end-to-end evaluation with WSD component

Demonstrate real impact of technique in system Difficult, expensive, still application specific

Typically, intrinsic, sense-based Accuracy, precision, recall SENSEVAL/SEMEVAL: all words, lexical sample

Baseline: Most frequent sense

Topline: Human inter-rater agreement: 75-80% fine; 90% coarse

23

Minimally Supervised WSDYarowsky’s algorithm (1995)

Bootstrapping approach:Use small labeled seedset to iteratively train

24

Minimally Supervised WSDYarowsky’s algorithm (1995)

Bootstrapping approach:Use small labeled seedset to iteratively train

Builds on 2 key insights:One Sense Per Discourse

Word appearing multiple times in text has same sense

Corpus of 37232 bass instances: always single sense

25

Minimally Supervised WSD Yarowsky’s algorithm (1995)

Bootstrapping approach: Use small labeled seedset to iteratively train

Builds on 2 key insights: One Sense Per Discourse

Word appearing multiple times in text has same senseCorpus of 37232 bass instances: always single sense

One Sense Per CollocationLocal phrases select single sense

Fish -> Bass1

Play -> Bass2

26

Yarowsky’s AlgorithmTraining Decision Lists

1. Pick Seed Instances & Tag

27


1. Pick Seed Instances & Tag2. Find Collocations: Word Left, Word Right,

Word +K

28



Word +K(A) Calculate Informativeness on Tagged Set,

Order:

29




Order:

(B) Tag New Instances with Rules

30




Order:

(B) Tag New Instances with Rules(C) Apply 1 Sense/Discourse(D)

31




Order:

(B) Tag New Instances with Rules(C) Apply 1 Sense/Discourse(D) If Still Unlabeled, Go To 2

32




Order:


3. Apply 1 Sense/Discourse

33




Order:


3. Apply 1 Sense/Discourse

Disambiguation: First Rule Matched

34

Yarowsky Decision List

35

Iterative Updating

36

There are more kinds of plants and animals in the rainforests than anywhere else on Earth. Over half of the millions of known species of plants and animals live in the rainforest. Many are found nowhere else. There are even plants and animals in therainforest that we have not yet discovered.Biological Example

The Paulus company was founded in 1938. Since those days the product range has been the subject of constant expansions and is brought up continuously to correspond with the state of the art. We’re engineering, manufacturing and commissioning world-wide ready-to-run plants packed with our comprehensive know-how. Our Product Range includes pneumatic conveying systems for carbon, carbide, sand, lime andmany others. We use reagent injection in molten metal for the…Industrial Example

Label the First Use of “Plant”

37

Sense Choice With Collocational Decision

ListsCreate Initial Decision List

Rules Ordered by

38



Rules Ordered by

Check nearby Word Groups (Collocations)Biology: “Animal” in + 2-10 words Industry: “Manufacturing” in + 2-10 words

39



Rules Ordered by

Check nearby Word Groups (Collocations)Biology: “Animal” in + 2-10 words Industry: “Manufacturing” in + 2-10 words

Result: Correct Selection95% on Pair-wise tasks

40

Self-TrainingBasic approach:

Start off with small labeled training set

Train a supervised classifier with the training set

Apply new classifier to residual unlabeled training data

Add ‘best’ newly labeled examples to labeled training

Iterate

41

Self-TrainingSimple – right?

42


Devil in the details:Which instances are ‘best’ to add?

43



Highest confidence?Probably accurate, butProbably add little new information to classifier

44



Highest confidence?Probably accurate, butProbably add little new information to classifier

Most different?Probably adds information, butMay not be accurate

Use most different, highly confident instances

45

Co-TrainingBlum & Mitchell, 1998

Basic intuition: “Two heads are better than one”

46


Basic intuition: “Two heads are better than one”Ensemble classifier:

Uses results from multiple classifiers

47


Basic intuition: “Two heads are better than one”Ensemble classifier:

Uses results from multiple classifiers

Multi-view classifier:Uses different views of data – feature subsetsIdeally, views should be:

Conditionally independent Individually sufficient – enough information to learn

48

Co-training Set-upCreate two views of data:

Typically partition feature set by typeE.g. predicting speech emphasis

View 1: Acoustics: loudness, pitch, duration View 2: Lexicon, syntax, context

49

Co-training Set-upCreate two views of data:

Typically partition feature set by typeE.g. predicting speech emphasis

View 1: Acoustics: loudness, pitch, duration View 2: Lexicon, syntax, context

Some approaches use learners of different types

In practice, views may not truly be conditionally indep.But often works pretty well anyway

50

Co-training ApproachCreate small labeled training data set

Train two (supervised) classifiers on current training Using different views

Use two classifiers to label residual unlabeled instances

Select ‘best’ newly labeled data to add to training data* Adding instances labeled by C1 to training data for C2, v.v.

Iterate

51

Graphically

Figure from Jeon&Liu’11

52

More Devilish DetailsQuestions for co-training:

53


Which instances are ‘best’ to add to training?Most confident? Most different? Random?Many approaches combine

54



How many instances to add per iteration?Threshold – by count, by value?

55



How many instances to add per iteration?Threshold – by count, by value?

How long to iterate?Fixed count? Threshold classifier confidence? etc…

56

Co-training ApplicationsApplied to many language related tasks

Blum & Mitchell’s paperAcademic home web page classification95% accuracy: 12 pages labeled; 788 classified

Sentiment analysis

Statistical parsing

Prominence recognition

Dialog classification

57

Learning Curves:Semi-supervised vs

Supervised

9 12 24 50 100 30066

68

70

72

74

76

78

80

82

84

supervisedsemi-supervised

# of labelled examples

Ac

cu

rac

y

58

Semi-supervised LearningUmbrella term for machine learning techniques

that:Use a small amount of labeled training dataAugmented with information from unlabeled data

59

Semi-supervised LearningUmbrella term for machine learning techniques

that:Use a small amount of labeled training dataAugmented with information from unlabeled data

Can be very effective:Training on ~10 labeled samples Can yield results comparable to training on 1000s

60

Semi-supervised LearningUmbrella term for machine learning techniques that:

Use a small amount of labeled training dataAugmented with information from unlabeled data

Can be very effective:Training on ~10 labeled samples Can yield results comparable to training on 1000s

Can be temperamental:Sensitive to data, learning algorithm, design choicesHard to predict effects of:

amount of labeled data, unlabeled data, etc

61

Summary

62

ReviewIntroduction:

Entropy, cross-entropy, and mutual information

Classic machine learning algorithms:Decision trees, kNN, Naïve Bayes

Discriminative machine learning algorithms:MaxEnt, CRFs, SVMs

Other models:TBL, EM, Semi-supervised approaches

63

General MethodsData organization:

Training, development, test data splits

Cross-validation:Parameter turning, evaluation

Feature selection:Wrapper methods, filtering, weighting

Beam search

64

Tools, Data, & TasksTools:

Mallet libSVM

Data:20 Newsgroups (Text classification)Penn Treebank (POS tagging)

65

Beyond 572Ling 573:

‘Capstone’ project class:Integrates material from 57* classesMore ‘real world’: project teams, deliverables,

repositories

66

Beyond 572Ling 573:


repositories

Ling 575s:Speech technology: Michael Tjalve (TH: 4pm)NLP on mobile devices: Scott Farrar (T: 4pm)

67

Beyond 572Ling 573:


repositories

Ling 575s:Speech technology: Michael Tjalve (TH: 4pm)NLP on mobile devices: Scott Farrar (T: 4pm)

Ling and other electives

68

Course Evaluationshttps://depts.washington.edu/oeaias/webq/surve

y.cgi?user=UWDL&survey=1397

Thank you!

https://depts.washington.edu/oeaias/webq/survey.cgi?user=UWDL&survey=1397



Documents

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012