Automatic Discovery of Scenario-Level Patterns for Information Extraction

ANLP-001

Automatic Discovery of Automatic Discovery of Scenario-Level Patterns for Scenario-Level Patterns for

Information ExtractionInformation Extraction

Roman Yangarber

Ralph Grishman

Pasi Tapanainen

Silja Huttunen

Outline

Information Extraction: background Problems in IE Prior Work: Machine Learning for IE Discover patterns from raw text Experimental results Current work

Quick Overview

What is Information Extraction ? Definition:

– finding facts about a specified class of events from free text

– filling a table in a data base (slots in a template)

Events: instances in relations, with many arguments

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

Example: Management Succession

Position Company Location Person Status

President European InformationServices, Inc.

London George Garrick Out

CEO Nielsen Marketing Research USA George Garrick In

Example: Management Succession

– George Garrick, 40 years old, president of the

London-based European Information Services

Inc., was appointed chief executive officer of

Nielsen Marketing Research, USA.

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

discourse

sentence

Lexical Analysis

System Architecture: Proteus

Name Recognition

Partial Syntax

Scenario Patterns

Reference Resolution

Discourse Analyzer

Output Generation

Input Text

Extracted Information

Problems

Customization Performance

Problems: Customization

To customize a system for a new extraction task, we have to develop– new patterns for new types of events

– word classes for the domain

– inference rules

This can be a large job requiring skilled labor–expense of customization limits uses of extraction

Problems: Performance

Performance on event IE is limited On MUC tasks, typical top performance is

recall < 55%, precision < 75%

Errors propagate through multiple phases:–name recognition errors

–syntax analysis errors

–missing patterns

–reference resolution errors

–complex inference required

Missing Patterns

As with many language phenomena–a few common patterns

–a large number of rare patterns

Rare patterns do not surface sufficiently often in limited corpus

Missing patterns make customization expensive and limit performance

Finding good patterns is necessary to improve customization and performance

Prior Research

build patterns from examples– Yangarber ‘97

generalize from multiple examples: annotated text– Crystal, Whisk (Soderland), Rapier (Califf)

active learning: reduce annotation– Soderland ‘99, Califf ‘99

learning from corpus with relevance judgements– Riloff ‘96, ‘99

co-learning/bootstrapping– Brin ‘98, Agichtein ‘00

Our Goals

Minimize manual labor required to construct pattern bases for new domain– un-annotated text

– un-classified text

– un-supervised learning

Use very large corpora -- larger than we could ever tag manually -- to improve coverage of patterns

Principle: Pattern Density

If we have relevance judgements for documents in a corpus, for the given task,then the patterns which are much more frequent in relevant documents will generally be good patterns

Riloff (1996) finds patterns related to terrorist attacks

Principle: Duality

Duality between patterns and documents:– relevant documents are strong indicators of

good patterns

– good patterns are strong indicators of relevant documents

Outline of Procedure

Initial query: a small set of seed patterns which partially characterize the topic of interest

repeat

Retrieve documents containing seed patterns: “relevant documents”

Rank patterns in relevant documents by frequency in relevant docs vs. overall frequency

Add top-ranked pattern to seed pattern set

#1: pick seed pattern

Seed: < person retires >

#2: retrieve relevant documents

Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

Relevant documents Otherdocuments

#3: pick new pattern

< person was named president > appears in several relevant documents (top-ranked by Riloff metric)

Fred retired....

Harry was named president.

Maki retired....

Yuki was named president.

#4: add new pattern to pattern set

Pattern set: < person retires >

< person was named president >

Pre-processing

For each document, find and classify names:– { person | location | organization | …}

Parse document– (regularize passive, relative clauses, etc.)

For each clause, collect a candidate pattern:tuple: heads of– [ subject

verb direct object object/subject complement locative and temporal modifiers… ]

Experiment

Task: Management succession (as MUC-6) Source: Wall Street Journal Training corpus: ~ 6,000 articles Test corpus:

– 100 documents: MUC-6 formal training

– + 150 documents judged manually

Experiment: two seed patterns

v-appoint = { appoint, elect, promote, name } v-resign = { resign, depart, quit, step-down }

Run discovery procedure for 80 iterations

Subject Verb Objectcompany v-appoint person

person v-resign -

Evaluation

Look at discovered patterns– new patterns, missed in manual training

Document filtering Slot filling

Discovered patterns

Subject Verb Objectcompany v-appoint person

person v-resign -

person succeed person

person be| become

president| officer| chairman| executive

company name president | …

person join | run| leave

company

person serve board | company

person leave post

Evaluation: new patterns

Not found in manual training

Subject Verb Object Complementscompany bring person [as+officer]person come

| return- [to+company]

[as+officer]person rejoin company [as+officer]

person continue| remain| stay

- [as+officer]

person replace person [as+officer]person pursue interest -

Evaluation: Text Filtering

How effective are discovered patterns at selecting relevant documents?

–IR-style

–documents matching at least one pattern

Pattern set Recall PrecisionSeed 11% 93%Seed+discovered 88% 81% (85)

250 Test Documents (.5)

0 10 20 30 40 50 60 70 80

Generation #

Recall

Precision

Choice of Test Corpus (.5)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

250-Corpus

100-MUC Corpus

MUC-6 Players

Evaluation: Slot filling

How effective are patterns within a complete IE system?

MUC-style IE on MUC-6 corpora

Caveat

training test

pattern set recall precision F recall precision F

seed 28 78 41

+ discovered 51 76 61

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Caveat

training test

seed 28 78 41

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Caveat

training test

seed 28 78 41

manual–MUC 54 71 62 47 70 56.40

manual–now 69 79 74 54.7 74.4 63.02

Conclusion: Automatic discovery

Performance comparable to human(4-week development)

From un-annotated text: allows us to take advantage of very large corpora– redundancy

– duality

Will likely help wider use of IE

Good Patterns

U - universe of all documentsR - set of relevant documentsH= H(p) - set of documents where pattern p matched

Density criterion:

Graded Relevance

Documents matching seed patterns considered 100% relevant Discovered patterns are considered less certain

Documents containing them are considered partially relevant

document frequency in relevant documents overall document frequency

document frequency in relevant documents – (metrics similar to those used in Riloff-96)

Scoring Patterns

Automatic Discovery of Scenario-Level Patterns for Information Extraction

Documents

CP Conserving (CPC) Benchmark Scenarios Discovery Potential in CPC scenarios Discrimination SM or Beyond The CP Violating CPX Scenario Discovery

Information Extraction Lecture 2 – IE Scenario, Text Selection/Processing, Extraction of Closed & Regular Sets

GROUNDWATER LEVEL SCENARIO IN ANDHRA PRADESH · REPORT ON GROUNDWATER LEVEL SCENARIO IN ANDHRA PRADESH ... drought and extraction of groundwater in the area. ... Table-I indicates

Information extraction and knowledge discovery … extraction and knowledge discovery from high-dimensional and high-volume complex data sets through precision manifold learning Erzs´ebet

ConceptNetworkExtractionfromText · IR Information Retrieval IE Information Extraction KDD Knowledge Discovery in Databases KDT Knowledge Discovery in Texts KWIC Key-WordinContext

Unsupervised Discovery of Scenario-Level Patterns for ...€¦Unsupervised Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber, Ralph Grishman, Pasi Tapanainen,

SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Knowledge extraction through etymological networks ...pabloem.github.io/assets/ICIKM_paper.pdf · 1 Knowledge extraction through etymological networks: Synonym discovery in Sino-Korean

Traditional approach for bioactive natural product discovery fractionation extraction Investigate bioactivity…

Supercritical Fluid Extraction Enhances Discovery of

Large-Scale Semantic Relationship Extraction for ... · Large-Scale Semantic Relationship Extraction for Information Discovery ... Large-Scale Semantic Relationship Extraction for

Global Scenario ,market potential & Business opportunities ... · PDF fileGlobal Scenario ,market potential & Business opportunities in Essential oil, ... Solvent . Extraction . Distillation

Keepr - Harvard University...Keepr’s Algorithm Entity extraction Topics Media extraction images, videos Link expansion articles Conversation analysis @ mentions source discovery

Entity Extraction to Visual Discovery

OASIS Web Services Dynamic Discovery (WS-Discovery ...docs.oasis-open.org/.../wsdd-discovery-1.1-spec-cs-01.docx · Web viewThe primary scenario for discovery is a client searching

Scenario for Viable Local and Regional Content Discovery

Open Information Extraction with Meta-pattern Discovery in ...hanj.cs.illinois.edu/pdf/bcb18_xwang.pdf · pattern discovery to extract structured relation tuples with little supervision

MARINE SPATIAL PLANNING PILOT SCENARIO 2: MARINE · PDF fileMARINE SPATIAL PLANNING PILOT SCENARIO 2: MARINE AGGREGATE EXTRACTION (Final) MSPP Consortium, November 2005 MSPP Consortium

Web Services Discovery and Recommendation Based on Information Extraction and Symbolic Reputation

Review of Biomedical Relation Extraction€¦ · Review of Biomedical Relation Extraction ... (e.g. chemical-diseases, drug-drug interactions) from biomedical text for knowledge discovery