30
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture th contributions from Tom Mitchell and Ellen Riloff)

Bootstrapping Information Extraction with Unlabeled Data

Embed Size (px)

DESCRIPTION

Bootstrapping Information Extraction with Unlabeled Data. Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture. (With contributions from Tom Mitchell and Ellen Riloff). What is Information Extraction?. - PowerPoint PPT Presentation

Citation preview

Page 1: Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Information Extraction with Unlabeled Data

Rayid Ghani Accenture Technology Labs

Rosie JonesCarnegie Mellon University & Overture

(With contributions from Tom Mitchell and Ellen Riloff)

Page 2: Bootstrapping Information Extraction with Unlabeled Data

What is Information Extraction?

Analyze unrestricted text in order to extract pre-specified types of events, entities or relationships

Recent Commercial Applications Database of Job Postings extracted from corporate web

pages (flipdog.com) Extracting specific fields from resumes to populate HR

databases (mohomine.com) Information Integration (fetch.com) Shopping Portals

Page 3: Bootstrapping Information Extraction with Unlabeled Data

IE Approaches

Hand-Constructed Rules Supervised Learning

Still costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction

(Seymore et al 99) 7000 labeled examples to learn MUC extraction rules

(Soderland 99)

Semi-Supervised Learning

Page 4: Bootstrapping Information Extraction with Unlabeled Data

Semi-Supervised Approaches

Several algorithms proposed for different tasks (semantic tagging, text categorization) and tested on different corpora Expectation-Maximization, Co-Training, CoBoost,

Meta-Bootstrapping, Co-EM, etc. Goal:

Systematically analyze and test The Assumptions underlying the algorithms The Effectiveness of the algorithms on a common set

of problems and corpus

Page 5: Bootstrapping Information Extraction with Unlabeled Data

Tasks

Extract Noun Phrases belonging to the following semantic classes Locations Organizations People

Page 6: Bootstrapping Information Extraction with Unlabeled Data

Aren’t you missing the obvious?

Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names

Named Entity Extraction? But not all instances are proper nouns

*by the river*, *customer*,*client*

Page 7: Bootstrapping Information Extraction with Unlabeled Data

Use context to disambiguate

A lot of NPs are unambiguous “The corporation”

A lot of contexts are also unambiguous Subsidiary of <NP>

But as always, there are exceptions….and a LOT of them in this case customer, John Hancock, Washington

Page 8: Bootstrapping Information Extraction with Unlabeled Data

Bootstrapping Approaches

Utilize Redundancy in Text Noun-Phrases

New York, China, place we met last time Contexts

Located in <X>, Traveled to <X> Learn two models

Use NPs to label Contexts Use Contexts to label NPs

Page 9: Bootstrapping Information Extraction with Unlabeled Data

Interesting Dimensions for Bootstrapping Algorithms

Incremental vs. Iterative Symmetric vs. Asymmetric Probabilistic vs. Heuristic

Page 10: Bootstrapping Information Extraction with Unlabeled Data

Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999)

Incremental, Asymmetric, Heuristic

Co-Training (Blum & Mitchell, 1999) Incremental, Symmetric, Probabilistic(?)

Co-EM (Nigam & Ghani, 2000) Iterative, Symmetric, Probabilistic

Baseline Seed-Labeling: label all NPs that match the seeds Head-Labeling: label all NPs whose head matches the seeds

Page 11: Bootstrapping Information Extraction with Unlabeled Data

Data Set

~4200 corporate web pages (WebKB project at CMU)

Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, none

Preprocessed (parsed) to generate NPs and extraction patterns using AutoSlog (Riloff, 1996)

Page 12: Bootstrapping Information Extraction with Unlabeled Data

Seeds

Location: australia, canada, china, england, france, germany, united states, switzerland, mexico, japan

People: customer, customers, subscriber, people, users, shareholders, individuals, clients, leader, director

Organizations: inc, praxair, company, companies, marine group, xerox, arco, timberlands, puretec, halter, marine group, ravonier

Page 13: Bootstrapping Information Extraction with Unlabeled Data

Intuition Behind Bootstrapping

the dog

australia

france

the canary islands

<X> ran away

travelled to <X>

<X> is beautiful

Noun Phrases Contexts

Page 14: Bootstrapping Information Extraction with Unlabeled Data

Co-Training(Blum & Mitchell, 99)

Incremental, symmetric, probabilistic1. Initialize with pos and neg NP seeds

2. Use NPs to label all contexts

3. Add n top scoring contexts for both positive and negative class

4. Use new contexts to label all NPS

5. Add n top scoring NPs for both positive and negative class

6. Loop

Page 15: Bootstrapping Information Extraction with Unlabeled Data

Co-EM(Nigam & Ghani, 2000)

Iterative, Symmetric, Probabilistic Similar to Co-Training Probabilistically labels and adds all NPs and

contexts to the labeled set

)|()|()|( ijjj

ii contextNPPNPclassPcontextclassP

)|()|()|( ijjj

ii NPcontextPcontextclassPNPclassP

Page 16: Bootstrapping Information Extraction with Unlabeled Data

Meta-Bootstrapping(Riloff & Jones, 99)

Incremental, Asymmetric, Heuristic Two-level process NPs are used to score contexts according to

co-occurring frequency and diversity After first level, all contexts are discarded and

only the best NPs are retained

Page 17: Bootstrapping Information Extraction with Unlabeled Data

Common Assumptions

Seeds Seed Density in the corpus Head-labeling Accuracy

Syntactic-Semantic Agreement

Redundancy Feature Sets are redundant and sufficient Labeling disagreement

Page 18: Bootstrapping Information Extraction with Unlabeled Data

Feature Set Ambiguity

Feature Sets: NPs and Contexts If Feature Sets were redundantly sufficient,

either of them alone would be enough to correctly classify the instance

Calculate the ambiguity for each feature set Washington, Went to <<X>>, Visit <<X>>

Page 19: Bootstrapping Information Extraction with Unlabeled Data

2%

NP Ambiguity

Ambiguity Type Class(es) Number of NPs

1 NoneLocationOrganizationPerson

3574114451189

2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person

63125613

3 Loc, Org, NoneOrg, Person, None

13

Page 20: Bootstrapping Information Extraction with Unlabeled Data

36%

Context Ambiguity

Ambiguity Type Class(es) Number of Contexts

1 NoneLocationOrganizationPerson

1068259859

2 Location, NoneOrganization, NonePerson, NoneLoc, OrgOrg, Person

51271206550

3 Loc, Org, NoneOrg, Person, None

1883

4 Loc, Org, Per, None

6

Page 21: Bootstrapping Information Extraction with Unlabeled Data

Labeling Disagreement

Agreement among human labelers Same set of instances but different levels of

information NP only Context Only NP and Context NP, Context and the entire sentence from the

corpus

Page 22: Bootstrapping Information Extraction with Unlabeled Data

Labeling Disagreement

90.5% agreement when NP, context and sentence are given

88.5% when sentence is not given

Page 23: Bootstrapping Information Extraction with Unlabeled Data

Results Comparing Bootstrapping Algorithms

Meta-Bootstrapping, Co-Training, co-EM

Locations, Organizations, Person

Page 24: Bootstrapping Information Extraction with Unlabeled Data

Co-EM

MetaBoot

Co-Training

Page 25: Bootstrapping Information Extraction with Unlabeled Data

Co-EM

MetaBoot

Co-Training

Page 26: Bootstrapping Information Extraction with Unlabeled Data

Co-EM

MetaBoot

Co-Training

Page 27: Bootstrapping Information Extraction with Unlabeled Data

More Results

Bootstrapping outperforms both baselines Improvement is less pronounced for “people”

class Ambiguous classes don’t benefit as much

from bootstrapping?

Page 28: Bootstrapping Information Extraction with Unlabeled Data

Why does co-EM work well?

Co-EM outperforms Meta-bootstrapping & Co-Training

Co-EM is probabilistic and does not do hard classifications

Reflective of the ambiguity among classes

Page 29: Bootstrapping Information Extraction with Unlabeled Data

Summary

Starting with 10 seed words, extract NPs matching specific semantic classes using MetaBootstrapping, Co-Training, Co-EM

Probabilistic Bootstrapping with redundant feature sets is effective – even for ambiguous classes

Co-EM performs robustly even when the underlying assumptions are violated

Page 30: Bootstrapping Information Extraction with Unlabeled Data

Ongoing Work

Varying initial seed size and type Collecting Training Corpus automatically

(from the Web) Incorporating the user in the loop (Active

Learning)