Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Information Extraction with Unlabeled Data

Rayid Ghani Accenture Technology Labs

Rosie JonesCarnegie Mellon University & Overture

(With contributions from Tom Mitchell and Ellen Riloff)

What is Information Extraction?

Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships

Recent Commercial Applications Database of Job Postings extracted from corporate web

pages (flipdog.com) Extracting specific fields from resumes to populate HR

databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

IE Approaches

Hand-Constructed Rules Supervised Learning

Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction

(Seymore et al 99) 7000 labeled examples to learn MUC extraction rules

(Soderland 99)

Semi-Supervised Learning

Semi-Supervised Approaches

Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost,

Meta-Bootstrapping, Co-EM, etc. Goal:

Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set

of problems and corpus

Tasks

Extract Noun Phrases belonging to the following semantic classes Locations Organizations People

Aren’t you missing the obvious?

Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names

Named Entity Extraction? But not all instances are proper nouns

*by the river*, *customer*,*client*

Use context to disambiguate

A lot of NPs are unambiguous “The corporation”

A lot of contexts are also unambiguous Subsidiary of <NP>

But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington

Bootstrapping Approaches

Utilize Redundancy in Text Noun-Phrases

New York, China, place we met last time Contexts

Located in <X>, Traveled to <X> Learn two models

Use NPs to label Contexts Use Contexts to label NPs

Interesting Dimensions for Bootstrapping Algorithms

Incremental vs. Iterative Symmetric vs. Asymmetric Probabilistic vs. Heuristic

Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999)

Incremental, Asymmetric, Heuristic

Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?)

Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic

Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds

Data Set

~4200 corporate web pages (WebKB project at CMU)

Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none

Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

Seeds

Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan

People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director

Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

Intuition Behind Bootstrapping

the dog

australia

france

the canary islands

<X> ran away

travelled to <X>

<X> is beautiful

Noun Phrases Contexts

Co-Training(Blum & Mitchell, 99)

Incremental, symmetric, probabilistic1. Initialize with pos and neg NP seeds

2. Use NPs to label all contexts

3. Add n top scoring contexts for both positive and negative class

4. Use new contexts to label all NPS

5. Add n top scoring NPs for both positive and negative class

6. Loop

Co-EM(Nigam & Ghani, 2000)

Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and

contexts to the labeled set

)|()|()|( ijjj

ii contextNPPNPclassPcontextclassP

)|()|()|( ijjj

ii NPcontextPcontextclassPNPclassP

Meta-Bootstrapping(Riloff & Jones, 99)

Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to

co-occurring frequency and diversity After first level, all contexts are discarded and

only the best NPs are retained

Common Assumptions

Seeds Seed Density in the corpus Head-labeling Accuracy

Syntactic-Semantic Agreement

Redundancy Feature Sets are redundant and sufficient Labeling disagreement

Feature Set Ambiguity

Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient,

either of them alone would be enough to correctly classify the instance

Calculate the ambiguity for each feature set Washington, Went to <<X>>, Visit <<X>>

2%

NP Ambiguity

Ambiguity Type Class(es) Number of NPs

1 NoneLocationOrganizationPerson

3574114451189

2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person

63125613

3 Loc, Org, NoneOrg, Person, None

13

36%

Context Ambiguity

Ambiguity Type Class(es) Number of Contexts

1 NoneLocationOrganizationPerson

1068259859

2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person

51271206550

3 Loc, Org, NoneOrg, Person, None

1883

4 Loc, Org, Per, None

6

Labeling Disagreement

Agreement among human labelers Same set of instances but different levels of

information NP only Context Only NP and Context NP, Context and the entire sentence from the

corpus

Labeling Disagreement

90.5% agreement when NP, context and sentence are given

88.5% when sentence is not given

Results Comparing Bootstrapping Algorithms

Meta-Bootstrapping, Co-Training, co-EM

Locations, Organizations, Person

Co-EM

MetaBoot

Co-Training

Co-EM

MetaBoot

Co-Training

Co-EM

MetaBoot

Co-Training

More Results

Bootstrapping outperforms both baselines Improvement is less pronounced for “people”

class Ambiguous classes don’t benefit as much

from bootstrapping?

Why does co-EM work well?

Co-EM outperforms Meta-bootstrapping & Co-Training

Co-EM is probabilistic and does not do hard classifications

Reflective of the ambiguity among classes

Summary

Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM

Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes

Co-EM performs robustly even when the underlying assumptions are violated

Ongoing Work

Varying initial seed size and type Collecting Training Corpus automatically

(from the Web) Incorporating the user in the loop (Active

Learning)

Documents

Bootstrapping Information Extraction with Unlabeled Data