J. Turmo, 2006 Adaptive Information Extraction Information Extraction Jordi Turmo TALP Research...

Preview:

Citation preview

J. Turmo, 2006 Adaptive Information Extraction

Information ExtractionInformation Extraction

Jordi Turmo

TALP Research CentreDep. Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunyaturmo@lsi.upc.edu

http://www.lsi.upc.edu/~turmo

Jordi Turmo

TALP Research CentreDep. Llenguatges i Sistemes Informàtics

Universitat Politècnica de Catalunyaturmo@lsi.upc.edu

http://www.lsi.upc.edu/~turmo

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

DefinitionDefinition

• Goal: Localization and extraction, in a specific format, of the relevant information included in a collection of documents

• Input requirements: scenario of extraction and document collection• Output requirements: output format

Introduction

J. Turmo, 2006 Adaptive Information Extraction

TypologyTypologyIntroduction

• Different points of view:− conceptual coverage: restricted-domain IE vs. open-domain IE− language coverage: monoligual IE vs. multilingual IE− media coverage: written text IE, speech IE, image IE, multimedia IE− document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)

J. Turmo, 2006 Adaptive Information Extraction

TypologyTypologyIntroduction

• Different points of view:− conceptual converage: restricted-domain IE vs. open-domain IE− language coverage: monoligual IE vs. multilingual IE− media coverage: written text IE, speech IE, image IE, multimedia IE− document type: IE from free text, from semi-structured documents, from structured documents (including Web pages in HTML and XML)

J. Turmo, 2006 Adaptive Information Extraction

Example 1: Structured documentsExample 1: Structured documentsIntroduction

• Web pages• A list of members of an organization per document• English • Scenario of Extraction

Name, degree, school and affiliation of the member

J. Turmo, 2006 Adaptive Information Extraction

Example 1: Structured documentsExample 1: Structured documentsIntroduction

Name Degree School Affiliation

WL Hsu PhD Cornell IIS, SinicaCS Ho PhD NTU EE,NTITC.Chen PhD SUNY EE,NTITC.Wu PhD Utexas Cedu,NNUMark Liao PhD NWU IIS, SinicaCJ Liau PhD NTU IIS, SinicaWK Cheng PhD TKU TunghaiWC Wang MS Syracus FIT...

J. Turmo, 2006 Adaptive Information Extraction

Example 2: Semi-structured documents

Example 2: Semi-structured documents

Introduction

• 485 seminar announcements• A description of one seminar per document• English • Scenario of Extraction

Speaker, location, start time and end time of the

seminar

J. Turmo, 2006 Adaptive Information Extraction

Example 2: Semi-structured documents

Example 2: Semi-structured documents

Introduction

J. Turmo, 2006 Adaptive Information Extraction

Example 3: Free textExample 3: Free textIntroduction

• 318 Wall Street Journal articles • A description of an incident per document• English• Scenario of Extraction

Type of incident, perpetrator, target, date, location,

effects and instrument

J. Turmo, 2006 Adaptive Information Extraction

Example 3: Free textExample 3: Free textIntroduction

A bomb went off this morning near a power tower in San Salvador leavinga large part of the city without energy, but no casualties have been reported.According to unofficial sources, the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part ofSan Salvador at 0650.

Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb

J. Turmo, 2006 Adaptive Information Extraction

Example 4: Free textExample 4: Free textIntroduction

• 78 documents • A description of mushroom per document• Spanish • Scenario of Extraction

colors of parts of mushrooms and the circumstances

in which they occur

J. Turmo, 2006 Adaptive Information Extraction

Example 4: Free textExample 4: Free textIntroduction

J. Turmo, 2006 Adaptive Information Extraction

Example 4: Free textExample 4: Free textIntroduction

El color blanco de su sombrero pasa a amarillo crema al corte.El sombrero ennegrece si se corta.

Sombrero_1color:

Sombrero_2color:

virar_1inicio:final:causa: corte

virar_2inicio: indeffinal:causa: corte

color_1base: blancotono: indefluz: indef

color_3base: indeftono: negroluz: indef

color_2base: amarillotono: cremaluz: indef

J. Turmo, 2006 Adaptive Information Extraction

Example 5: CombinationExample 5: CombinationIntroduction

• 78 documents • A description of mushroom per document• Spanish • Scenario of Extraction

Names of the mushroom in different languages, ethimology

colors of parts of mushrooms and the circumstances

in which they occur

J. Turmo, 2006 Adaptive Information Extraction

Example 5: CombinationExample 5: CombinationIntroduction

J. Turmo, 2006 Adaptive Information Extraction

ApplicationsApplicationsIntroduction

• IE from the Web• Building of news DBs• Information Integration• Support for QA and Summarization …

Limitation when P<80%

J. Turmo, 2006 Adaptive Information Extraction

ReferencesReferencesIntroduction

• D.E. Appelt, D.J. Israel, 1999

• E. Hovy, 1999• R.J. Mooney, C. Cardie,

1999• Muslea, 1999• J. Cowie, Y. Wilks, 2000• M.T. Pazienza, 2003• Turmo, 2003• Turmo et al. 2005

J. Turmo, 2006 Adaptive Information Extraction

Recent eventsRecent eventsIntroduction

• IJCAI 2001 Workshop on Adaptive Text Extraction and Mining (ATEM-2001)

• ECML 03/PKDD Workshop on Adaptive Text Extraction and Mining (ATEM-2003)

• AAAI 04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004)

• EACL 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)

• COLING-ACL 06 Workshop on Information Extraction Beyond the Document

• ECAI 06 Workshop on Adaptive Text Extraction and Mining (ATEM-2006)

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

Origin of IEOrigin of IEHistorical framework

• Acquisition of the relevant information involved in knowledge-based systems

• Traditionally Traditionally (High human cost)(High human cost)

Experts Experts

on the on the

DomainDomain

ManualManual

ProcessProcess

RelevantRelevant

InformationInformation

J. Turmo, 2006 Adaptive Information Extraction

Origin of IEOrigin of IEHistorical framework

• Acquisition of the relevant information involved in knowledge-based systems

Text-based Text-based Intelligent Intelligent SystemsSystems

RelevantRelevant

InformationInformation

• 80’s 80’s (text sources)(text sources)

J. Turmo, 2006 Adaptive Information Extraction

Origin of IEOrigin of IEHistorical framework

• Text-Based Intelligent Systems (TBIS)− Information Retrieval− Information Integration − Information Filtering− Information Routing− Information Extraction− Document Classification− Question Answering− Automatic Summarization− Topic Detection & Tracking...

J. Turmo, 2006 Adaptive Information Extraction

Relevant Historical ProgramsRelevant Historical ProgramsHistorical framework

• Precedents: LSP (Sager, 81), FRUMP (DeJong, 82),

JASPER (Hayes, 86)

• in USA− (1987-1991): MUC [US Navy]

− TIPSTER (1991-1998): MUC [DARPA]

− TIDES (1999-): ACE [NIST]

• in Europe− LRE (1993-1996): TREE, AVENTINUS, FACILE, ECRAN, SPARKLE

− PASCAL excellence network (2003-)

J. Turmo, 2006 Adaptive Information Extraction

MUC EvolutionMUC EvolutionHistorical framework

• MUC-1 (1987)– naval operations– auto-definition of scenarios– auto-evaluation

• MUC-2 (1989)– naval operations– output structure with 10 attributes (type of event, agent, place, ...)

– auto-evaluation

J. Turmo, 2006 Adaptive Information Extraction

MUC EvolutionMUC EvolutionHistorical framework

• MUC-3 (1991), – Latin-American terrorism– output structure with 18 attributes (type of incident, date, place, ...)– recall and precision measures

extracted

relevant

ab

c

de

f

parcially

extracted

extracted = a + b + e + frelevant = a + f + drecall = a + 0.5 f/ (a + f + d)precision = a + 0.5 f/ (a + f + b + e)

J. Turmo, 2006 Adaptive Information Extraction

MUC EvolutionMUC EvolutionHistorical framework

• MUC-4 (1992), – Latin-American terrorism– 24 attributes– F-score (harmonic average)

r pβrp1)(β

F 2

2

• MUC-5 (1993), – Financial news, microelectronics– English, Japanese

J. Turmo, 2006 Adaptive Information Extraction

MUC EvolutionMUC EvolutionHistorical framework

• MUC-6 (1995), – finantial news– subtasks: NE, coreference tasks: TE (template element), ST

(scenario template)

• MUC-7 (1998),– air crashes– new task: TR (template relation)

J. Turmo, 2006 Adaptive Information Extraction

MUC EvolutionMUC EvolutionHistorical framework

• MUC-6, MUC-7 – Partial extractions are discarded

extracted

relevant

a

b

c

d

extracted = a + brelevant = a + drecall = a / (a + d)precision = a / (a + b)

r pβ

rp1)(β F

2

2

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

General ArchitectureGeneral ArchitectureArchitecture

• Hobbs,93:

– Cascade of transducers (or modules) that add structure to text and, often, drop out irrelevant information by applying rules

J. Turmo, 2006 Adaptive Information Extraction

Traditional ArchitectureTraditional ArchitectureArchitecture

Conceptual HierarchyConceptual Hierarchy

Pattern MatchingPattern Matching

Pattern Base

Document PreprocessingDocument Preprocessing

PostprocessPostprocess

J. Turmo, 2006 Adaptive Information Extraction

Traditional ArchitectureTraditional ArchitectureArchitecture

Lexical AnalysisLexical Analysis

Pattern MatchingPattern Matching

Conceptual Hierarchy

Pattern BasePattern Base

Text ControlText Control

Syntactic AnalysisSyntactic Analysis

PostprocessPostprocess

J. Turmo, 2006 Adaptive Information Extraction

Traditional ArchitectureTraditional ArchitectureArchitecture

Conceptual HierarchyConceptual Hierarchy

Pattern MatchingPattern Matching

Pattern BaseDiscourse AnalysisDiscourse Analysis

Output Template GenerationOutput Template Generation

Output FormatOutput Format

Lexical AnalysisLexical Analysis

Text ControlText Control

Syntactic AnalysisSyntactic Analysis

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Text controlText controlArchitecture

• Filtering relevant documents• Guessing the language of the documents• Splitting documents into textual zones• Filtering relevant zones• Splitting text into appropriate units (eg.

sentences)• Filtering relevant units• Tokenizing units

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Text controlText controlArchitecture

• Example

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Text controlText controlArchitecture

• Example

<Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Lexical analysisLexical analysisArchitecture

• Identifying morpho-syntactic categories and semantic categories of words General lexicon

• Recognizing terminology words Specific dictionaries

• Recognizing time expressions, quantities, abbreviations, …• Extending abbreviations

Lists of abbrev. + expansion

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Lexical analysisLexical analysisArchitecture

• Recognizing and classifying proper nouns (Named Entities –NERC-) Gazetteers Patterns

• Dealing with unknown words• Dealing with lexical ambiguities

POS taggers WSD (???)

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Lexical analysisLexical analysisArchitecture

• Example1

<Sombrero bastante carnoso de 4 a 8 cm , convexo , luego completamente extendido , aplanado y mamelonado , liso , húmedo e higrófano .> <Esta última condición influye en la variabilidad de su coloración desde canela claro a toda la gama de tostados .> <Con la edad generalmente palidece sus tonos .>

<Puede confundirse con otras foliotas comestibles , pero alguna especie es amarga . ><Los aficionados poco experimentados pueden también confundir este género con otros no comestibles , como Hypholoma y Flacemula , también lignícolas.>

time expressions

mushroom names

abbreviatures

numbers

morphologic parts

Depends on

the scenario

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Lexical analysisLexical analysisArchitecture

• Example2

time expressions

locations

organizations

persons

<A bomb went off this morning near a power tower in San Salvador leaving a large part of the city without energy , but no casualties have been reported .><According to unofficial sources , the bomb-allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650 .>

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Syntactic analysisSyntactic analysisArchitecture

• Full parsing (Lolita, LaSIE, LaSIE-II)

– inefficient, sizes of the grammars– missing robustness (off vocabulary)– treebank grammars– cascaded grammars

• Solves some problems related to the tuning and incompleteness

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Syntactic analysisSyntactic analysisArchitecture

• Partial parsing

−the most commonly used−chunks or phrasal trees (noun phrases,

verbal phrases, prep phrases, adj phrases, adv phrases)

−absence of global dependences

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Semantic interpretationSemantic interpretationArchitecture

• Compositive semantics

− full parsing + λ-expressions −LaSIE, LaSIE-II−Entries with λ-expressions in the Lexicons

−partial parsing + gramatical relations [Vilain,99]

−output = logical forms

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Semantic interpretationSemantic interpretationArchitecture

A bomb went off this morning near a power tower in San Salvador …

• Compositive semantics (example1)

np np pp

np

pp

vp

s

go_off → λ(t) λ(s) λ(r) λ(z) λ(y) λ(x) (bombing(x,y,z,r,s,t))power_tower → λ(x) (power_tower(x))

λ(z) λ(y) λ(x) (bombing(x,y,z,bomb,today_morning,power_tower(San_Salvador)))

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Semantic interpretationSemantic interpretationArchitecture

A bomb went off this morning near a power tower in San Salvador …

location_ofsubj time

place

event(bombing , E)subj(bomb , E)time(today_morning , E)place(power_tower, E)location_of(power_tower, San_Salvador)

• Compositive semantics (example2)

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Semantic interpretationSemantic interpretationArchitecture

• Pattern matching

−after partial parsing + svo dependences−the most extended−patterns can be implemented in different

ways −scenario driven approach (TE, TR, ST, …)

−Output = partial templates

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Semantic interpretationSemantic interpretationArchitecture

• Pattern matching (example)

A bomb went off this morning near a power tower in San Salvador …

np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→

INSTRUMENT := C-instrumentDATE := C-timePHIS_TARGET := C-placeLOCATION := C-location

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Discourse analysisDiscourse analysisArchitecture

• Inter-sentence analysis−Co-reference resolution−Ellipsis resolution−Alias resolution−Traditional semantic interpretation

procedures−Template merging procedures

• Inference procedures−Open-domain and domain-specific

knowledge for inferences

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Discourse analysisDiscourse analysisArchitecture

• Example

A bomb went off this morning near a power tower in San Salvador …, but no casualties have been reported

According to unofficial sources , the bomb -allegedly detonated by urban guerrilla commandos- blew up a power tower in the northwestern part of San Salvador at 0650

λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))

λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650,power_tower(the_northwestern_part_of_San_Salvador)))

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Discourse analysisDiscourse analysisArchitecture

• Example

λ(y) (bombing(urban_guerrilla_comandos,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))

λ(z) λ(y) (bombing(urban_guerrilla_comandos,y,z,bomb,0650, power_tower( the_northwestern_part_of_San_Salvador)))

λ(y) λ(x) (bombing(x,y,no_casualties,bomb,today_morning,power_tower(San_Salvador)))

Unification & inference

bombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,today_morning,power_tower(San_Salvador))

Inference (blew_up → destroyed)

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Output template generationOutput template generationArchitecture

• Mapping of the extracted pieces onto the desired output format

• Specific inferences:− Normalization to predefined values of slots− Mandatory slots− Extracted information that implies different

slot values

J. Turmo, 2006 Adaptive Information Extraction

Architecture

Output template generationOutput template generationArchitecture

• Examplebombing(urban_guerrilla_comandos,destroyed,no_casualties,bomb,today_morning,power_tower(San_Salvador))

Today_morning → March_19No_casualties = no_injuries_or_death

Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Characteristics of IE systemsCharacteristics of IE systems

• Strong dependence of the domain−Scenario of extraction−Semantics vs. syntax−Discourse analysis

• Strong dependence of the text structure−Sublanguages−Meta-information

• Strong dependence of the output format−BDs−annotations

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Characteristics of IE systemsCharacteristics of IE systems

• Importance of the portability and tuning• Importance of the Knowledge

Engineering−Modularity

−Basic tasks and specific tasks−Use of weak and local knowledge

• Importance of the NL resources−MDRs, ontologies, general lexicons, specific

dictionaries, …

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Knowledge resourcesKnowledge resources

• Knowledge more or less stable− general lexicon− general grammar− basic NL processors: segmenters, taggers,

parsers, …

• Domain dependent knowledge − Domain specific vocabularies, terminology− gazetteers and patterns for NERC− IE patterns Knowledge specifically used for IEIE

patterns

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Viewpoint 1: type of representation

− rules

np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→Event:INSTRUMENT := C-instrument Event:DATE := C-timeEvent:PHIS_TARGET := C-place Event:LOCATION := C-location

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Viewpoint 1: type of representation

− statistical models (BNs, HMMs, ME, Hyperplanes, …)

whospeaker5409appointment

withabouthow…

seminarremindertheater…

thatbyspeaker…

dr.professorrobertmichaelmr

wcavalierstevenschristel

will(receivedHas…

1.0

1.0

0.99

0.76

0.24

0.99 0.56

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Viewpoint 2: type of values extracted− slot filler extraction patterns

(the HMM presented before)

whospeaker5409appointment

withabouthow…

seminarremindertheater…

thatbyspeaker…

dr.professorrobertmichaelmr

wcavalierstevenschristel

will(receivedHas…

1.0

1.0

0.99

0.76

0.24

0.99 0.56

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Viewpoint 2: type of values extracted− slot filler extraction patterns

(the HMM presented before)

− event extraction patterns (the rule presented

before)np(C-instrument) … vp(go_off) … np(C-time) … “near” np(C-place) “in” np(C-location)→Event:INSTRUMENT := C-instrument Event:DATE := C-timeEvent:PHIS_TARGET := C-place Event:LOCATION := C-location

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Point of view: type of values extracted− slot filler extraction patterns

(the HMM presented before)

np(C-person) … vp(is) pron(C-his) “wife” →Married_with:HUSBAND := C-hisMarried_with:WIFE := C-person

− relation extraction patterns

− event extraction patterns (the rule presented

before)

J. Turmo, 2006 Adaptive Information Extraction

Knowledge specific for IE

Types of IE patternsTypes of IE patterns

• Viewpoint 3: number of slot fillers extracted− single-slot IE patterns

(the HMM presented before)

− multi-slot IE patterns (both rules presented

before)

J. Turmo, 2006 Adaptive Information Extraction

SummarySummary

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

• Information Extraction Systems• Introduction

• Historical framework

• Architecture

• Knowledge specific for IE

• Examples

• Evaluation

• Multilinguality

• Adaptability

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

Methodologies [Turmo,2002]Methodologies [Turmo,2002]

LaSIELaSIE-IILOLITACIRCUSFASTUSBADGERHASTENPROTEUSALEMBICPIETURBIOPLUMIE2LOUELLASIFT

System Reference Parsing Semantics Discourse

Gaizauskas et al, 1995Humphreys et al, 1998Garigliano et al, 1998Lehnert et al, 1991Hobbs et al, 1993Fisher et al, 1995Krupka, 1995Grishman, 1995Aberdeen et al, 1993Lin, 1995Turmo,2002Weischedel et al, 1995Aone et al, 1998Childs et al, 1995Miller et al, 1998

indepth understanding

template merging

Chunking Pattern matching -

semantic Gramm relations interp interpretation procedures

Partial Parsing pattern matching

Pattern matching template merging -

sintactico-semantic parsing

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

Knowledge [Turmo,2002]Knowledge [Turmo,2002]

LaSIELaSIE-IILOLITACIRCUSFASTUSBADGERHASTENPROTEUSALEMBICTURBIOPIEPLUMIE2

LOUELLASIFT

System Parsing Semantics Discourse

Treebank grammar -expressionshand-crafted stratified general grammar General grammar semantic network

concept nodes (AutoSlog) hand-crafted IE rules concept nodes (CRYSTAL) decision trees

Phrasal grammar E-graphs IE rules (ExDISCO)

hand-crafted gram relations IE rules (EVIUS)

General grammar hand-crafted IE rules

hand-crafted rules

hand-crafted IE rules decision trees

Statistical models for syntactic-semantic parsing & coreference resolution learned from PTBand on-domain annotated texts

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

Templatewriter

Lexicon Conceptual hierarchygazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Stratified grammar

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

Templatewriter

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Preprocessing• NERC preprocess via gazetters and keyword lists• Root form and inflexional suffix for verbs, nouns and adjs found in sentences

According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prep the-det northwestern-adj part-n of-prep San Salvador-loc at-prep 0650

Conceptual hierarchyStratified grammar

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

TemplateWriter

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time, timex)

According_to-adv unofficial-adj source[s]-n , the-det bomb-n – allegedly-adv detonate[ed]-v by-prep urban-adj guerrilla-n commando[s]-n - blow_up-v a-det power_tower-n in-prep the-det northwestern part of San Salvador-loc at-prep 0650-time

Conceptual hierarchyStratified grammar

NE1 NE2

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

TemplateWriter

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time) • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)

S(According_to-adv NP(unofficial-adj source[s]-n) , NP(the-det bomb-n) – allegedly-adv VP(detonate[ed]-v) PP(by-prep NP(urban-adj guerrilla-n commando[s]-n)) - VP(blow_up-v) NP(a-det power_tower-n) PP(in-prep NP(the-det NE1-loc)) PP(at-prep NP(NE2-time)))

Conceptual hierarchyStratified grammar

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

TemplateWriter

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Syntactico-semantic interpretation• bottom-up chart parser• cascade of NERC grammars (eg. aircraft, person, money, time) • cascade of partial grammars (NPs, PPs, complex NP, VPs, complex VPs, RelClauses, Sentence)• QLFs (Note: the real implementation of QLFs is not specified)

Conceptual hierarchyStratified grammar

Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y), Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

Templatewriter

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Discourse analysis• Name matcher: Matches variants of NEs across the text• Discourse interpreter:

• adds QLF representation to a semantic net (links)• adds presuppositions• coreference resolution

Conceptual hierarchyStratified grammar

location of eventdestroy

bombing event

Event(E1), detonate(E1,Y,X), urban_guerrilla_comando(X), bomb(Y), Event(E2), blow_up(E2,Y,Z), power_tower(Z), location_of(Z,NE1), time_of(E2,NE2)

isa

implies

implies

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

LaSIE-II systemLaSIE-II system

Sentencesplitter

Buchart parser

Namematcher

Discourseinterpreter

Template writer

Lexicongazetteers

Gazetteerlookup

TE TR ST

Brilltagger

Taggedmorph

Output template generation• procedure that write the templates in the desired format

Conceptual hierarchyStratified grammar

Incident type: bombingdate: March 19Location: El Salvador: San Salvador (city)Perpetrator: urban guerrilla commandosPhysical target: power towerHuman target: -Effect on physical target: destroyedEffect on human target: no injury or deathInstrument: bomb

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

TE TR ST

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Preprocessing

According_to-adv unofficial-adj sources-n , the-det bomb-n – allegedly-adv detonated-v by-prep urban-adj guerrilla-n commandos-n - blew_up-v a-det power_tower-n in-prep the-det northwestern part of San Salvador-loc at-prep 0650-time

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

Rules

NE1 NE2

Conceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• semantics refer to types of slot fillers (Conceptual hierarchy)

According_to-adv NP(unofficial-adj sources-n-s1) , NP(the-det bomb-n-artifact) – allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) – VP(blew_up-v-s4) NP(a-det power_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• IE-rules for relations (appositions, PP-attachments, limited conjunctions)

• NP(A-person) , B-integer years old , → instance(X,person), name_of(X,A), age_of(X,B)• NP(A-position) of NP(B-company) → instance(X,person), position_of(X,A), company_of(X,B)

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

Bage

Aname

personClass

ValueSlot

Real implementation as objects

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Sintactico-semantic interpretation• basic VP and NP chunks+head_semantics• IE-rules for relations (appositions, PP-attachments, limited conjunctions)• IE-rules for events (PET interface or ExDISCO)

• NP(A-artifact) v-s4 NP(B-building) → instance(E1,s4), instrument_of(E1,A), phisical_target_of(E1,B)

According_to-adv NP(unofficial-adj sources-n-s1) , NP(the-det bomb-n-artifact) – allegedly-adv VP(detonated-v-s3) by-prep NP(urban-adj guerrilla-n commandos-n-person) – VP(blew_up-v-s4) NP(a-det power_tower-n-building) in-prep NP(NE1-location) at-prep NP(NE2-time)

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Discourse analysis• antecedents found seeking in sequential order.• constraints:

• instance of a hyperclass• same number• share arguments

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Discourse analysis• QLFs + inference rules = more complex QLFs

• conversion of date expressions.• inference of slot values from the QLFs already achieved• inference of events from others explicitly described

Fred, the president of Cuban Cigar Corp., was appointed vice president of MicrosoftimpliesFred left the Cuban Cigar Corp.

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

PROTEUS systemPROTEUS system

Output template generation• use of rules to build the templates with the desired format

NERC Partial parsing

ScenarioPatterns

Coreferenceresolution

DiscourseAnalysis

Output generator

Lexicon NERC Rules

Lexical Analizer

TE TR ST

Chunk grammar IE-Rules Format

RulesConceptual hierarchy

Inference Rules

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

Preprocessing• only NERC • SGML-tagged• general NE types and subtypes• restricted-domain NE types and subtypes

<person id=1>Jeff Bantle</person>, <entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

Syntactico-semantic interpretation• SGML-tagging of phrases that are values of slots• NPs denoting persons (PNP), organizations (ENP), artifacts (ANP), …• local links (location-of, employee-of, owner-of, …)

<person id=1>Jeff Bantle</person>, <PNP affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

Syntactico-semantic interpretation• SGML-tagging of phrases that are values of slots in templates• NPs• local semantic relations (employee-of, location-of, product-of, …)• event IE-rules (note: the real implementation is not specified)

• $Vehicle + LaunchN → launch_event::vehicle_info := $Vehicle

<launch_event id=2 vehicle_info=1><ANP> The <vehicle id=1>Arian 5</vehicle> launch </ANP> was successfully achieved at 6am

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

Discourse analysis• Three coreference resolution methods

• Rule based• Machine learning based• Hybrid

• Name alias resolution in addition to that performed by NetOwl • Definite NPs• Singular personal pronouns

<person id=1>Jeff Bantle</person>, <PNP ref=1 affil=2><entity id=2>NASA</entity>’s mission operations directorate representative for the shuttle flight</PNP>

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

IE2 systemIE2 system

NetOwlExtractor 3.0

CustomNameTag

PhraseTag EventTagDiscourseModule

TempGen

TE TR STHand-craftedrules

Decisiontree

Output template generation• Translates SGML output into templates in the desired format• Solves and normalizes time expressions• Performs event merging

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

SIFT systemSIFT system

Sentence level Cross-sentece levelOutput

generator

Statistical models

IdentifinderTM TE TR

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

SIFT systemSIFT system

Sentence level Cross-sentece levelOutput

generator

Statistical models

IdentifinderTM TE TR

Preprocessing• NERC using a HMM [Bikel et al. 97] + Viterbi maximizing Pr(W,F,C)• each word is tagged with one NE class

person organization location not-a-name

start-sentence

end-sentence

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

SIFT systemSIFT system

Sentence level Cross-sentece levelOutput

generator

Statistical models

IdentifinderTM TE TR

Syntactico-semantic interpretation• properties of NEs (TE) and relations (TR)• generative statistical model [Miller et al. 98, 00]• search the most likely augmented parse tree (bottom-up chart based)• prunning of low probability constituents

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

SIFT systemSIFT system

Sentence level Cross-sentece levelOutput

generator

Statistical models

IdentifinderTM TE TR

Syntactico-semantic interpretation

Nance , a paid consultant to ABC News , …

per/nnp , det vbn per-desc/nn to org’/nnp org/nnp ,

per-r/np per-desc/np org-r/np

org-ptr/pp

emp-of/pp-lnk

per-desc-r/npper/np

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

SIFT systemSIFT system

Sentence level Cross-sentece levelOutput

generator

Statistical models

IdentifinderTM TE TR

Syntactico-semantic interpretation• relations between NEs across sentences• statistical model [Miller et al. 98]• classifier of pairs of entities

• entities in different sentences• entities do not take part into local relations• their types are compatible with any relation

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

TURBIO systemTURBIO system

NERC Partial parsing controller

Output generator

Lexicon IE-rule set scheduling

NERC Rules

Lexical Analizer

TE TR

Partial-tree grammar

IE-Rule set processor

IE-Rule sets

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

TURBIO systemTURBIO system

NERC Partial parsing controller

Output generator

Lexicon IE-rule set scheduling

NERC Rules

Lexical Analizer

TE TR

Partial-tree grammar

IE-Rule set processor

IE-Rule sets

Preprocessing• WordNet synsets, lemmas, POS tags• NERC• parsed trees of noun, verbal, and adjectival phrases

J. Turmo, 2006 Adaptive Information Extraction

Examples of IE systems

TURBIO systemTURBIO system

NERC Partial parsing controller

Output generator

Lexicon IE-rule set scheduling

NERC Rules

Lexical Analizer

TE TR

Partial-tree grammar

IE-Rule set processor

IE-Rule sets

Syntactico-semantic interpretation• Hypotesis: dependence among relations of NEs• Iterative execution of IE-rule sets depending on the scheduling• Example:

• Scenario = Mushroom parts, their possible colors and the circumstances by which they are produced• There are colors in the documents that are not related to any mushroom part, but all colors related with a circumstance are colors related to mushroom parts.

Recommended