Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi [email protected] sunita

Classifying with limited training dataActive and semi-supervised learning

Sunita Sarawagi

[email protected]

http://www.it.iitb.ac.in/~sunita

mailto:[email protected]

Sarawagi 2

Motivation Several learning methods critically dependent

on quality of labeled training data Often labeled data is expensive to collect and

unlabeled data is abundant Two techniques to reduce labeling effort

Active learning: Iteratively select small sets of unlabeled data to be

labeled by a human Semi-supervised learning

Use unlabeled data to train classifier

Sarawagi 3

Outline Active learning

Definition Application Algorithms Case studies:

Duplicate elimination Information Extraction

Semi-supervised learning Definition Some methods

Sarawagi 4

Application areas Text classification Duplicate elimination Information Extraction

HTML wrappers Free text

Speech recognition Reducing the need for transcribed data

Semantic parsing of natural language Reducing need for complex annotated data

Sarawagi 5

Example: active learning

Sure reds Sure greensRegion of uncertainity

Assume: Points from two classes (red and green) on a real line perfectly separable by a single point separator

Unlabeled points labeled points

y

Need greatest expected reduction in the size of theuncertainty region

Sarawagi 6

Active-learning

Explicit measure: For each unlabeled instance

For each class label Add to training data, Train classifier Measure classifier confusion

Compute expected confusion Choose instance that yields

lowest expected confusion

Implicit measure: Train classifier For each unlabeled

instance Measure prediction

uncertainty Choose instance

with highest uncertainty

Sarawagi 7

Measuring prediction certainty Classifier-specific methods

Support vector machines: Distance from separator

Naïve Bayes classifier: Posterior probability of winning class

Decision tree classifier: Weighted sum of distance from different boundaries, error of

the leaf, depth of the leaf, etc

Committee-based approach:

(Seung, Opper, and Sompolinsky 1992) Disagreement amongst members of a committee Most successfully used method

Sarawagi 8

Forming a classifier committee

Randomly perturb learnt parameters Probabilistic classifiers:.

Sample from posterior distribution on parameters given training data.

Example: binomial parameter p has a beta distribution with mean p

Discriminative classifiers: Random boundary in uncertainty region

Sarawagi 9

Committee-based algorithm Train k classifiers C1, C2,.. Ck on training

data For each unlabeled instance x

Find prediction y1,.., yk from the k classifiers Compute uncertainty U(x) as entropy of above y-s

Pick instance with highest uncertainty

Sampling for representativeness:With weight as U(x), do weighted sampling to select an instance for labeling.

Sarawagi 10

Case study: Duplicate eliminationGiven a list of semi-structured records,

find all records that refer to a same entity Example applications:

Data warehousing: merging name/address lists Entity:

a) Person

b) Household

Automatic citation databases (Citeseer): references Entity: paper

Challenges: Errors and inconsistencies in large datasets Domain-specific

Sarawagi 11

Motivating example: Citations Our prior:

duplicate when author, title, booktitle and year match..

Author match could be hard: L. Breiman, L. Friedman, and P. Stone, (1984). Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone.

Conference match could be harder: In VLDB-94 In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Fields may not be segmented, Word overlap could be misleading

Duplicates with little overlap even in title Johnson Laird, Philip N. (1983). Mental models. Cambridge, Mass.: Harvard University Press.

P. N. Johnson-Laird. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge University Press, 1983

Non-duplicates with lots of word overlap H. Balakrishnan, S. Seshan, and R. H. Katz., Improving Reliable Transport and Hando Performance in Cellular Wireless Networks, ACM Wireless Networks, 1(4), December 1995.

H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz, "Improving TCP/IP Performance over Wireless Networks," Proc. 1st ACM Conf. on Mobile Computing and Networking, November 1995.

Sarawagi 13

Experiences with the learning approach Too much manual search in preparing

training data Hard to spot challenging and covering sets of

duplicates in large lists Even harder to find close non-duplicates that will

capture the nuances

Active learning is a generalization of this!

examine instances that are similar on one attribute but dissimilar on another

Sarawagi 14

Learning to identify duplicates

f1 f2 …fn Similarity functions

Examplelabeledpairs

Record 1 D Record 2

Record 3 NRecord 4

1.0 0.4 … 0.2 1

0.0 0.1 … 0.3 0Classifier

Record 6 Record 7Record 8 Record 9Record 10Record 11

Unlabeled list 0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?

0.7 0.1 … 0.6 10.3 0.4 … 0.4 0

Active learner

0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?

Forming committee of trees Selecting split attribute

Normally: attribute with lowest entropy Perturbed: random attribute within close range of

lowest Selecting a split point

Normally: midpoint of range with lowest entropy Perturbed: a random point anywhere in the range

with lowest entropy

Sarawagi 16

Experimental analysis 250 references from Citeseer 32000 pairs

of which only 150 duplicates Citeseer’s script used to segment into author,

title, year, page and rest. 20 text and integer similarity functions Average of 20 runs Default classifier: decision tree Initial labeled set: just two pairs

Sarawagi 17

Methods of creating committee

Data partition bad when limited data Attribute partition bad when sufficient data Parameter perturbation: best overall

Sarawagi 18

Importance of randomization

Important to randomize selection for generative classifiers like naïve Bayes

Decision tree Naïve Bayes

Sarawagi 19

Choosing the right classifier

SVMs good initially but not effective in choosing instances

Decision trees: best overall

Sarawagi 20

Benefits of active learning

Active learning much better than random With only 100 active instances

97% accuracy, Random only 30% Committee-based selection close to optimal

Sarawagi 21

Analyzing selected instances Fraction of duplicates in selected instances:

44% starting with only 0.5% Is the gain due to increased fraction of

duplicates? Replaced non-duplicates in selected set with

random non-dups Result only 40% accuracy!!!

Sarawagi 22

Case study: Information Extraction (IE)

The IE task: Given, E: a set of structured elements (Target schema) S: unstructured source S

extract all instances of E from S

Varying levels of difficulty depending on input and kind of extracted patterns Text segmentation: Extraction by segmenting text HTML wrapper: Extraction from formatted text Classical IE: Extraction from free-format text

Sarawagi 23

IE by text segmentation

Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records

P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.

Author Year Title JournalVolume

Page

36/307 Unnat Nagar (II) Goregaon (W) Bombay 400 079

House number Area City Zip

Sarawagi 24

IE with Hidden Markov Models Probabilistic models for IE

Title

Journal

Author 0.9

0.5

0.50.8

0.2

0.1

Transition probabilitie

s

Year

A

B

C

0.6

0.3

0.1

journal

ACM

IEEE

0.4

0.2

0.3

Letter

Et. al

Word

0.3

0.1

0.5

Emission probabiliti

es

dddd

dd

0.8

0.2

Sarawagi 25

A model for Indian Addresses

Sarawagi 26

Active learning in IE with HMM

Forming committee of HMMs by random perturbation Emission and transition probabilities are independent

multinomial distributions. Posterior distribution for Multinomial parameters:

Dirichlet with mean estimated as using maximum likelihood Results on part of speech tagging (Dagan 1999)

92.6% accuracy using active learning with 20,000 instances as against 100,000 random

Sarawagi 27

Semi-supervised learning Unlabeled data can improve classifier

accuracy by providing correlation information between features

Three methods: Probabilistic classifiers like naïve Bayes HMMs

The Expectation Maximization method (EM) Distance-based classifiers like k-Nearest neighbor

Graph min-cut method Paired independent classifiers

Co-training

Sarawagi 28

The EM approach Dl: labeled data, Du: unlabeled data

Train classifier parameter using Dl

While likelihood of Dl + Du improves E step: For each d in Du, find fractional

membership in each class using current classifier parameter

M step: Use fractional membership of Du and labels of Dl to re-estimate maximum likelihood parameters of classifier

Output classifier

Sarawagi 29

Results with EM Practical considerations:

When unlabeled data too large and class-labels don’t correspond to natural data clusters, need to weight contribution of unlabeled data to parameters

Experiments on text classification with Naïve Bayes 20 Newsgroup: 70% accuracy with 10,000 labeled

reduced to 600 + 20000 unlabeled Experiments on IE with HMM

No improvement in accuracy

Sarawagi 30

The Graph min-cut method Construct a weighted graph using Dl + Du

Dl = Dl+ + Dl

-

Wij = Similarity between i and j

wij

Sarawagi 31

Conclusion Active learning

successfully used in several applications to reduce need for training data

Semi-supervised learning Limited improvement observed in text

classification with naïve Bayes Most proposed methods classifier-specific Still open to further research

References

Shlomo Argamon-Engelson and Ido Dagan. Committee-based sample selection for probabilistic classififers. J. of Artificial Intelligence Research, 11:335--360, 1999.

Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133-168, 1997.

S Sarawagi and Anuradha Bhamidipaty, Interactive deduplication using active learning, ACM SIGKDD 2002

H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287-294, 1992.

T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. ICML, 2000

Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation for extracting structured records. SIGMOD 2001.

D Freitag and A McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, AAAI 2000

Documents

Classifying with limited training data Active and semi-supervised learning Sunita Sarawagi [email protected] sunita