Upload
homer-french
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Classifying with limited training dataActive and semi-supervised learning
Sunita Sarawagi
http://www.it.iitb.ac.in/~sunita
Sarawagi 2
Motivation Several learning methods critically dependent
on quality of labeled training data Often labeled data is expensive to collect and
unlabeled data is abundant Two techniques to reduce labeling effort
Active learning: Iteratively select small sets of unlabeled data to be
labeled by a human Semi-supervised learning
Use unlabeled data to train classifier
Sarawagi 3
Outline Active learning
Definition Application Algorithms Case studies:
Duplicate elimination Information Extraction
Semi-supervised learning Definition Some methods
Sarawagi 4
Application areas Text classification Duplicate elimination Information Extraction
HTML wrappers Free text
Speech recognition Reducing the need for transcribed data
Semantic parsing of natural language Reducing need for complex annotated data
Sarawagi 5
Example: active learning
Sure reds Sure greensRegion of uncertainity
Assume: Points from two classes (red and green) on a real line perfectly separable by a single point separator
Unlabeled points labeled points
y
Need greatest expected reduction in the size of theuncertainty region
Sarawagi 6
Active-learning
Explicit measure: For each unlabeled instance
For each class label Add to training data, Train classifier Measure classifier confusion
Compute expected confusion Choose instance that yields
lowest expected confusion
Implicit measure: Train classifier For each unlabeled
instance Measure prediction
uncertainty Choose instance
with highest uncertainty
Sarawagi 7
Measuring prediction certainty Classifier-specific methods
Support vector machines: Distance from separator
Naïve Bayes classifier: Posterior probability of winning class
Decision tree classifier: Weighted sum of distance from different boundaries, error of
the leaf, depth of the leaf, etc
Committee-based approach:
(Seung, Opper, and Sompolinsky 1992) Disagreement amongst members of a committee Most successfully used method
Sarawagi 8
Forming a classifier committee
Randomly perturb learnt parameters Probabilistic classifiers:.
Sample from posterior distribution on parameters given training data.
Example: binomial parameter p has a beta distribution with mean p
Discriminative classifiers: Random boundary in uncertainty region
Sarawagi 9
Committee-based algorithm Train k classifiers C1, C2,.. Ck on training
data For each unlabeled instance x
Find prediction y1,.., yk from the k classifiers Compute uncertainty U(x) as entropy of above y-s
Pick instance with highest uncertainty
Sampling for representativeness:With weight as U(x), do weighted sampling to select an instance for labeling.
Sarawagi 10
Case study: Duplicate eliminationGiven a list of semi-structured records,
find all records that refer to a same entity Example applications:
Data warehousing: merging name/address lists Entity:
a) Person
b) Household
Automatic citation databases (Citeseer): references Entity: paper
Challenges: Errors and inconsistencies in large datasets Domain-specific
Sarawagi 11
Motivating example: Citations Our prior:
duplicate when author, title, booktitle and year match..
Author match could be hard: L. Breiman, L. Friedman, and P. Stone, (1984). Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone.
Conference match could be harder: In VLDB-94 In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.
Fields may not be segmented, Word overlap could be misleading
Duplicates with little overlap even in title Johnson Laird, Philip N. (1983). Mental models. Cambridge, Mass.: Harvard University Press.
P. N. Johnson-Laird. Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Cambridge University Press, 1983
Non-duplicates with lots of word overlap H. Balakrishnan, S. Seshan, and R. H. Katz., Improving Reliable Transport and Hando Performance in Cellular Wireless Networks, ACM Wireless Networks, 1(4), December 1995.
H. Balakrishnan, S. Seshan, E. Amir, R. H. Katz, "Improving TCP/IP Performance over Wireless Networks," Proc. 1st ACM Conf. on Mobile Computing and Networking, November 1995.
Sarawagi 13
Experiences with the learning approach Too much manual search in preparing
training data Hard to spot challenging and covering sets of
duplicates in large lists Even harder to find close non-duplicates that will
capture the nuances
Active learning is a generalization of this!
examine instances that are similar on one attribute but dissimilar on another
Sarawagi 14
Learning to identify duplicates
f1 f2 …fn Similarity functions
Examplelabeledpairs
Record 1 D Record 2
Record 3 NRecord 4
1.0 0.4 … 0.2 1
0.0 0.1 … 0.3 0Classifier
Record 6 Record 7Record 8 Record 9Record 10Record 11
Unlabeled list 0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?
0.7 0.1 … 0.6 10.3 0.4 … 0.4 0
Active learner
0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?
Forming committee of trees Selecting split attribute
Normally: attribute with lowest entropy Perturbed: random attribute within close range of
lowest Selecting a split point
Normally: midpoint of range with lowest entropy Perturbed: a random point anywhere in the range
with lowest entropy
Sarawagi 16
Experimental analysis 250 references from Citeseer 32000 pairs
of which only 150 duplicates Citeseer’s script used to segment into author,
title, year, page and rest. 20 text and integer similarity functions Average of 20 runs Default classifier: decision tree Initial labeled set: just two pairs
Sarawagi 17
Methods of creating committee
Data partition bad when limited data Attribute partition bad when sufficient data Parameter perturbation: best overall
Sarawagi 18
Importance of randomization
Important to randomize selection for generative classifiers like naïve Bayes
Decision tree Naïve Bayes
Sarawagi 19
Choosing the right classifier
SVMs good initially but not effective in choosing instances
Decision trees: best overall
Sarawagi 20
Benefits of active learning
Active learning much better than random With only 100 active instances
97% accuracy, Random only 30% Committee-based selection close to optimal
Sarawagi 21
Analyzing selected instances Fraction of duplicates in selected instances:
44% starting with only 0.5% Is the gain due to increased fraction of
duplicates? Replaced non-duplicates in selected set with
random non-dups Result only 40% accuracy!!!
Sarawagi 22
Case study: Information Extraction (IE)
The IE task: Given, E: a set of structured elements (Target schema) S: unstructured source S
extract all instances of E from S
Varying levels of difficulty depending on input and kind of extracted patterns Text segmentation: Extraction by segmenting text HTML wrapper: Extraction from formatted text Classical IE: Extraction from free-format text
Sarawagi 23
IE by text segmentation
Source: concatenation of structured elements with limited reordering and some missing fields Example: Addresses, bib records
P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115, 12231-12237.
Author Year Title JournalVolume
Page
36/307 Unnat Nagar (II) Goregaon (W) Bombay 400 079
House number Area City Zip
Sarawagi 24
IE with Hidden Markov Models Probabilistic models for IE
Title
Journal
Author 0.9
0.5
0.50.8
0.2
0.1
Transition probabilitie
s
Year
A
B
C
0.6
0.3
0.1
journal
ACM
IEEE
0.4
0.2
0.3
Letter
Et. al
Word
0.3
0.1
0.5
Emission probabiliti
es
dddd
dd
0.8
0.2
Sarawagi 25
A model for Indian Addresses
Sarawagi 26
Active learning in IE with HMM
Forming committee of HMMs by random perturbation Emission and transition probabilities are independent
multinomial distributions. Posterior distribution for Multinomial parameters:
Dirichlet with mean estimated as using maximum likelihood Results on part of speech tagging (Dagan 1999)
92.6% accuracy using active learning with 20,000 instances as against 100,000 random
Sarawagi 27
Semi-supervised learning Unlabeled data can improve classifier
accuracy by providing correlation information between features
Three methods: Probabilistic classifiers like naïve Bayes HMMs
The Expectation Maximization method (EM) Distance-based classifiers like k-Nearest neighbor
Graph min-cut method Paired independent classifiers
Co-training
Sarawagi 28
The EM approach Dl: labeled data, Du: unlabeled data
Train classifier parameter using Dl
While likelihood of Dl + Du improves E step: For each d in Du, find fractional
membership in each class using current classifier parameter
M step: Use fractional membership of Du and labels of Dl to re-estimate maximum likelihood parameters of classifier
Output classifier
Sarawagi 29
Results with EM Practical considerations:
When unlabeled data too large and class-labels don’t correspond to natural data clusters, need to weight contribution of unlabeled data to parameters
Experiments on text classification with Naïve Bayes 20 Newsgroup: 70% accuracy with 10,000 labeled
reduced to 600 + 20000 unlabeled Experiments on IE with HMM
No improvement in accuracy
Sarawagi 30
The Graph min-cut method Construct a weighted graph using Dl + Du
Dl = Dl+ + Dl
-
Wij = Similarity between i and j
wij
Sarawagi 31
Conclusion Active learning
successfully used in several applications to reduce need for training data
Semi-supervised learning Limited improvement observed in text
classification with naïve Bayes Most proposed methods classifier-specific Still open to further research
References
Shlomo Argamon-Engelson and Ido Dagan. Committee-based sample selection for probabilistic classififers. J. of Artificial Intelligence Research, 11:335--360, 1999.
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28(2-3):133-168, 1997.
S Sarawagi and Anuradha Bhamidipaty, Interactive deduplication using active learning, ACM SIGKDD 2002
H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Computational Learing Theory, pages 287-294, 1992.
T. Zhang and F. J. Oles. A probability analysis on the value of unlabeled data for classification problems. ICML, 2000
Vinayak R. Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic text segmentation for extracting structured records. SIGMOD 2001.
D Freitag and A McCallum, Information Extraction with HMM Structures Learned by Stochastic Optimization, AAAI 2000