Upload
casey-mcintosh
View
31
Download
5
Tags:
Embed Size (px)
DESCRIPTION
Bootstrapping Information Extraction with Unlabeled Data. Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture. (With contributions from Tom Mitchell and Ellen Riloff). What is Information Extraction?. - PowerPoint PPT Presentation
Citation preview
Bootstrapping Information Extraction with Unlabeled Data
Rayid Ghani Accenture Technology Labs
Rosie JonesCarnegie Mellon University & Overture
(With contributions from Tom Mitchell and Ellen Riloff)
What is Information Extraction?
Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships
Recent Commercial Applications Database of Job Postings extracted from corporate web
pages (flipdog.com) Extracting specific fields from resumes to populate HR
databases (mohomine.com) Information Integration (fetch.com) Shopping Portals
IE Approaches
Hand-Constructed Rules Supervised Learning
Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction
(Seymore et al 99) 7000 labeled examples to learn MUC extraction rules
(Soderland 99)
Semi-Supervised Learning
Semi-Supervised Approaches
Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost,
Meta-Bootstrapping, Co-EM, etc. Goal:
Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set
of problems and corpus
Tasks
Extract Noun Phrases belonging to the following semantic classes Locations Organizations People
Aren’t you missing the obvious?
Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names
Named Entity Extraction? But not all instances are proper nouns
*by the river*, *customer*,*client*
Use context to disambiguate
A lot of NPs are unambiguous “The corporation”
A lot of contexts are also unambiguous Subsidiary of <NP>
But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington
Bootstrapping Approaches
Utilize Redundancy in Text Noun-Phrases
New York, China, place we met last time Contexts
Located in <X>, Traveled to <X> Learn two models
Use NPs to label Contexts Use Contexts to label NPs
Interesting Dimensions for Bootstrapping Algorithms
Incremental vs. Iterative Symmetric vs. Asymmetric Probabilistic vs. Heuristic
Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999)
Incremental, Asymmetric, Heuristic
Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?)
Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic
Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds
Data Set
~4200 corporate web pages (WebKB project at CMU)
Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none
Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)
Seeds
Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan
People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director
Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier
Intuition Behind Bootstrapping
the dog
australia
france
the canary islands
<X> ran away
travelled to <X>
<X> is beautiful
Noun Phrases Contexts
Co-Training(Blum & Mitchell, 99)
Incremental, symmetric, probabilistic1. Initialize with pos and neg NP seeds
2. Use NPs to label all contexts
3. Add n top scoring contexts for both positive and negative class
4. Use new contexts to label all NPS
5. Add n top scoring NPs for both positive and negative class
6. Loop
Co-EM(Nigam & Ghani, 2000)
Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and
contexts to the labeled set
)|()|()|( ijjj
ii contextNPPNPclassPcontextclassP
)|()|()|( ijjj
ii NPcontextPcontextclassPNPclassP
Meta-Bootstrapping(Riloff & Jones, 99)
Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to
co-occurring frequency and diversity After first level, all contexts are discarded and
only the best NPs are retained
Common Assumptions
Seeds Seed Density in the corpus Head-labeling Accuracy
Syntactic-Semantic Agreement
Redundancy Feature Sets are redundant and sufficient Labeling disagreement
Feature Set Ambiguity
Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient,
either of them alone would be enough to correctly classify the instance
Calculate the ambiguity for each feature set Washington, Went to <<X>>, Visit <<X>>
2%
NP Ambiguity
Ambiguity Type Class(es) Number of NPs
1 NoneLocationOrganizationPerson
3574114451189
2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person
63125613
3 Loc, Org, NoneOrg, Person, None
13
36%
Context Ambiguity
Ambiguity Type Class(es) Number of Contexts
1 NoneLocationOrganizationPerson
1068259859
2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person
51271206550
3 Loc, Org, NoneOrg, Person, None
1883
4 Loc, Org, Per, None
6
Labeling Disagreement
Agreement among human labelers Same set of instances but different levels of
information NP only Context Only NP and Context NP, Context and the entire sentence from the
corpus
Labeling Disagreement
90.5% agreement when NP, context and sentence are given
88.5% when sentence is not given
Results Comparing Bootstrapping Algorithms
Meta-Bootstrapping, Co-Training, co-EM
Locations, Organizations, Person
Co-EM
MetaBoot
Co-Training
Co-EM
MetaBoot
Co-Training
Co-EM
MetaBoot
Co-Training
More Results
Bootstrapping outperforms both baselines Improvement is less pronounced for “people”
class Ambiguous classes don’t benefit as much
from bootstrapping?
Why does co-EM work well?
Co-EM outperforms Meta-bootstrapping & Co-Training
Co-EM is probabilistic and does not do hard classifications
Reflective of the ambiguity among classes
Summary
Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM
Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes
Co-EM performs robustly even when the underlying assumptions are violated
Ongoing Work
Varying initial seed size and type Collecting Training Corpus automatically
(from the Web) Incorporating the user in the loop (Active
Learning)