82
Information Extraction October 13, 2006

Information Extraction October 13, 2006. What is Information Extraction? Input: Specification: Types of entities to find Types of relations to find

Embed Size (px)

Citation preview

Page 1: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Information Extraction

October 13, 2006

Page 2: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

What is Information Extraction?

Input: Specification:

Types of entities to find Types of relations to find Templates to fill

Corpus of text: Possibly formatted Possibly annotated for

linguistic structure

Output: Text + annotation:

Entities tagged w/type and coreference info

Relations b/t entities tagged

Filled templates: Instances of templates

found in text

Page 3: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC: Genesis of IE

DARPA funded significant efforts in IE in the early to mid 1990’s. Message Understanding Conference (MUC) was an annual

event/competition where results were presented. Focused on extracting information from news articles:

Terrorist events Industrial joint ventures Company management changes

Information extraction of particular interest to the intelligence community (CIA, NSA). (Note: early ’90’s)

Page 4: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC

Named entity Person, Organization, Location

Co-reference Clinton President Bill Clinton

Template element Perpetrator, Target

Template relation Incident

Multilingual

Page 5: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Named entities and events

San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …

Page 6: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Coreference links

San Salvador, 19 Apr 89 (ACAN-EFE) -- [TEXT] Salvadoran President-elect Alfredo Cristiani condemned the terrorist killing of Attorney General Roberto Garcia Alvarado and accused the Farabundo Marti National Liberation Front (FMLN) of the crime. … Garcia Alvarado, 56, was killed when a bomb placed by urban guerrillas on his vehicle exploded as it came to a halt at an intersection in downtown San Salvador. … Vice President-elect Francisco Merino said that when the attorney general's car stopped at a light on a street in downtown San Salvador, an individual placed a bomb on the roof of the armored vehicle. …

Page 7: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(Partial) Scenario template

Incident: Date 19 Apr 89

Incident: Location El Salvador: San Salvador (CITY)

Incident: Type Bombing

Perpetrator: Individual ID “urban guerrillas”

Perpetrator: Organization ID “FMLN”

Perpetrator: Organization Confidence Suspected or Accused by Authorities: "FMLN"

Physical Target: Description “vehicle”

Physical Target: Effect Some Damage: “vehicle”

Human Target: Name “Roberto Garcia Alvarado”

Human Target: Description “attorney general”: “Roberto Garcia Alvarado”

Human Target: Effect Death: “Roberto Garcia Alvarado”

Page 8: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

Page 9: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC Typical Text

Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production of 20,000 iron and “metal wood” clubs a month

Page 10: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC Templates

Relationship tie-up

Entities: Bridgestone Sports Co, a local concern, a Japanese trading house

Joint venture company Bridgestone Sports Taiwan Co

Activity ACTIVITY 1

Amount NT$2,000,000

Page 11: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC Templates

ATIVITY 1 Activity

Production Company

Bridgestone Sports Taiwan Co Product

Iron and “metal wood” clubs Start Date

January 1990

Page 12: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

Example from Fastus (1993)

Page 13: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 14: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 15: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Automated Content Extraction

Objectives: Extract information from texts of varying quality Detect unique entities, events, and relations:

Find all entity mentions Link mentions by entity

Track entities within and across documents Output XML for downstream processes

Page 16: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

ACE entity and mention types

Entity Type Subtypes

Person (PER) N/A

Organization (ORG) Government, Commerical, Educational, Non-profit, Other

Location (LOC) Address, Boundary, Celestial, Land-Region-Natural, Region-Local, Region-Subnational, Region-National, Region-International, Water-Body, Other

Geo-Political Entity (GPE) Continent, Nation, State-or-Province, County-or-District, Population-Center, Other

Facility (FAC) Building, Subarea-Building, Bounded-Area, Conduit, Path, Barrier, Plant, Other

Vehicle (VEH) Land, Air, Water, Subarea-Vehicle, Other

Weapon (WEA) Blunt, Exploding, Sharp, Chemical, Biological, Shooting, Projectile, Nuclear, Other

Entity Mention Type Description

Name (NAM) A proper name reference to the entity

Nominal (NOM) A common noun reference to the entity

Pronominal (PRO) A pronoun reference to the entity

Premodifier (PRE) A premodifier reference to the entity

Page 17: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

ACE relation and event types

Relation type Subtypes

Physical (PHYS) Located, Near, Part-Whole

Personal/ Social (PER-SOC)

Business, Family, Other

Employment/ Membership/ Subisdiary (EMP-ORG)

Employ-Executive, Employ-Staff, Employ-Undetermined, Member-of-Group, Partner, Subsidiary, Other

Agent-Artifact (ART) User-or-Owner, Inventor-or-Manufacturer, Other

PER/ORG Affliation (OTHER-AFF)

Ethnic, Ideology, Other

GPE Affliation Citizen-or-Resident, Based-in, Other

Discourse (DISC) N/A

Event Types

Destruction/ Damage (BRK)

Creation/ Improvement (MAK)

Transfer of Possession or Control (GIV)

Movement (MOV)

Interaction of Agents (INT)

Event roles

Agent

Object

Source (MOV/GIV)

Target (MOV/GIV)

Time

Location

Other

Page 18: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Applications

Information gathering (intelligence tasks) Question answering

Answer extraction from retrieved documents Ontology induction Improving indexing for IR

Page 19: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

IE task breakdown

Entities: Identification: finding entity mentions Classification: determining entity type Normalization: standardizing entity mentions (e.g.,

identifying co-referring entity mentions) Relations:

Association: identifying related entities and their relations

Page 20: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Two approaches to IE

Knowledge-engineering approach Grammar rules built by hand Human expert generates domain-specific patterns through

introspection and corpus work Iterative process: build, test, evaulate errors, repeat

Data-driven approach Use statistical methods Learn recognizers and classifiers from annotated data where

available Leverage unannotated corpora, if possible, by bootstrapping

Page 21: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Knowledge engineering

Advantages: Conceptually straightforward Best-performing systems still hand-built

Disadvantages: Lots of human effort required Human expertise also required Not readily portable to new domains or languages

Page 22: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Data-driven approach

Advantages: Porting to new domains straightforward Domain expertise not necessary Good coverage is ensured

Disadvantages: Training data may not exist or may be difficult to acquire Changes in specification may require re-annotation of

training data

Page 23: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Which approach to use?

Use hand-built rule-based approach when: Resources (esp. lexicons)

available Rule writers available Training data unavailable or

hard to get Extraction specifications

subject to change Highest possible performance

needed

Use data-driven approach when: Resources unavailable Rule writers unavailable Training data cheap and

plentiful Extraction specifications

stable Good performance good

enough

Page 24: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Typical NLP tasks for IE

Tokenization Finding word boundaries

Lexical lookup Using domain lexicons w/type information, e.g., first-name lists,

place-name lists, etc. Part-of-speech tagging

POS tags provide generalization for later processes Can be hand-built or machine-learned

Shallow parsing Coreference resolution

Page 25: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Shallow parsing: cascaded finite-state transducers Limited linguistic analysis:

Grammar divided into levels (chunks and clauses) Pipeline of finite-state recognizers/transducers

Robust: Local decisions, no global optimization Easy-first parsing

High-precision decisions Attachment decisions can be indefinitely delayed

Time and space efficient Deterministic search

Page 26: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Natural Language Processing-based Information Extraction

If extracting from automatically generated web pages, simple regex patterns usually work.

If extracting from more natural, unstructured, human-written text, some NLP may help. Part-of-speech (POS) tagging

Mark each word as a noun, verb, preposition, etc. Syntactic parsing

Identify phrases: NP, VP, PP Semantic word categories (e.g. from WordNet)

KILL: kill, murder, assassinate, strangle, suffocate Extraction patterns can use POS or phrase tags.

Crime victim: Prefiller: [POS: V, Hypernym: KILL] Filler: [Phrase: NP]

Page 27: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC: the NLP genesis of IE

DARPA funded significant efforts in IE in the early to mid 1990’s.

Message Understanding Conference (MUC) was an annual event/competition where results were presented.

Focused on extracting information from news articles: Terrorist events Industrial joint ventures Company management changes

Information extraction is of particular interest to the intelligence community

Page 28: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE from FASTUS (1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 29: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 30: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite state automata (FSA) transductions

set upnew Taiwan dollars

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]

Page 31: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

Finite Automaton forNoun groups:John’s interestingbook with a nice cover

Grep++ = Cascaded grepping

Page 32: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Rule-based Extraction Examples

Determining which person holds what office in what organization [person] , [office] of [org]

Vuk Draskovic, leader of the Serbian Renewal Movement [org] (named, appointed, etc.) [person] P [office]

NATO appointed Wesley Clark as Commander in Chief

Determining where an organization is located [org] in [loc]

NATO headquarters in Brussels [org] [loc] (division, branch, headquarters, etc.)

KFOR Kosovo headquarters

Page 33: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

IE with hidden markov models

Page 34: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Hidden Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

...

...

Finite state model Graphical model

Parameters: for all states S={s1,s2,…} Start state probabilities: P(st ) Transition probabilities: P(st|st-1 ) Observation (emission) probabilities: P(ot|st )Training: Maximize probability of training observations (w/ prior)

||

11 )|()|(),(

o

ttttt soPssPosP

HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, …

...transitions

observations

o1 o2 o3 o4 o5 o6 o7 o8

Generates:

State sequenceObservation sequence

Usually a multinomial over atomic, fixed alphabet

Page 35: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Markov Property

S2

S2S1

1/2

1/2 1/3

2/3

1

The state of a system at time t+1, qt+1, is conditionally independent of {qt-1, qt-2, …, q1, q0} given qt

In another word, current state determines the probability distribution for the next state.

S1: rainS2: cloudS3: sun

Page 36: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Markov Property

S2

S3S1

1/2

1/2 1/3

2/3

1

State-transition probabilities,

A =

S1: rainS2: cloudS3: sun

033.067.0

05.05.0

100

Q: given today is sunny (i.e., q1=3),what is the probability of “sun-cloud”with the model?

Page 37: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Hidden Markov ModelS1: rainS2: cloudS3: sun

S2

S3S1

1/2

1/21/3

2/3

14/5

1/10

7/101/5 3/10

9/10

observations

O1 O2 O3 O4 O5

state sequences

Page 38: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

IE with Hidden Markov Model

SI/EECS 767 is held weekly at SIN2 .

SI/EECS 767 is held weekly at SIN2

Course name: SI/EECS 767

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “course name”state extract as a course name:

),(maxarg osPs

course name

location name

background

Page 39: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Name Entity Extraction[Bikel, et al 1998]

Person

Org

Other

(Five other name classes)

start-of-sentence

end-of-sentence

Hidden states

Page 40: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Name Entity ExtractionTransitionprobabilities

Observationprobabilities

P(st | st-1, ot-1 ) P(ot | st , st-1 )

P(ot | st , ot-1 )or

(1) Generating first word of a name-class

(2) Generating the rest of words in the name-class

(3) Generating “+end+” in a name-class

Page 41: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

HMM-Experimental Results

Train on ~500k words of news wire text.

Results:

Page 42: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Learning HMM for IE[Seymore, 1999]

Consider labeled, unlabeled, and distantly-labeled data

Page 43: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Some Issues with HMM

Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range

dependencies of the observations Very strict independence assumptions on the observations

Page 44: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

We want More than an Atomic View of WordsWould like richer representation of text: many arbitrary, overlapping features of the words.

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchorlast person name was femalenext two words are “and Associates”

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Page 45: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Maximum Entropy Markov Models

St -1

St

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations

...)|Pr( tt xsCourtesy of William W. Cohen

[Lafferty, 2001]

Page 46: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Problems with Richer Representationand a Generative Model

These arbitrary features are not independent. Multiple levels of granularity (chars, words, phrases) Multiple dependent modalities (words, formatting, layout) Past & future

Two choices:

Model the dependencies.Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies.This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

St -1

St

Ot

St+1

Ot +1

Ot -1

St -1

St

Ot

St+1

Ot +1

Ot -1

Page 47: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MEMMSt -1 S

t

Ot

St+1

Ot +1

Ot -1

identity of wordends in “-ski”is capitalizedis part of a noun phraseis in a list of city namesis under node X in WordNetis in bold fontis indentedis in hyperlink anchor…

…part of

noun phrase

is “Wisniewski”

ends in “-ski”

Idea: replace generative model in HMM with a maxent model, where state depends on observations and previous state history

......),|Pr( ,2,1 tttt ssxsCourtesy of William W. Cohen

Page 48: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

HMM vs. MEMMSt-1 St

Ot

St+1

Ot+1Ot-1

...

i

iiii sossos )|Pr()|Pr(),Pr( 11

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iii ossos ),|Pr()|Pr( 11

Page 49: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Conditional Sequence Models

We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(s|o) instead of P(s,o):

Can examine features, but not responsible for generating them.

Don’t have to explicitly model their dependencies.

Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Page 50: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Conditional Markov Models (CMMs) vs HMMS

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iiii sossos )|Pr()|Pr(),Pr( 11

St-1 St

Ot

St+1

Ot+1Ot-1

...

i

iii ossos ),|Pr()|Pr( 11

Lots of ML ways to estimate Pr(y | x)

Page 51: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

nn oooossss ,...,,..., 2121

Joint

Conditional

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

St-1 St

Ot

St+1

Ot+1Ot-1

...

...

||

11 )|()|(),(

o

ttttt soPssPosP

kttkko osft ),(exp)(

(A super-special case of Conditional Random Fields.)

Conditional Finite State Sequence Models

From HMMs to CRFs[Lafferty, McCallum, Pereira 2001]

[McCallum, Freitag & Pereira, 2000]

||

11 )|()|(

)(

1)|(

o

ttttt soPssP

oPosP

||

11 ),(),(

)(

1 o

tttotts soss

oZ

where

Arbitrary features of s,o, and t

Page 52: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Feature Functions:),,,( Example 1 tossf ttk

otherwise 0

s s )d(Capitalize if 1),,,( j1i

1,d,Capitalizettt

ttss

ssotossf

ji

Yesterday Pedro Domingos spoke this example sentence.

s3

s1 s2

s4

1 )2,,,( 21,, 31 ossf ssdCapitalize

o = o1 o2 o3 o4 o5 o6 o7

Page 53: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Learning Parameters of CRFs

),,,(),(# where

),'(# )|'(),(#

1

2'

)()(

,

tossfos

ososPosL

ttt

kk

k

i s

ik

i

Dosk

k

Methods:• iterative scaling (quite slow – 2000 iterations from good start)• gradient, conjugate gradient (faster)• limited-memory quasi-Newton methods (“super fast”)

[Sha & Pereira 2002] & [Malouf 2002]

Maximize log-likelihood of parameters k given training data D

Log-likelihood gradient:

k

k

Dos

o

t kttkk tossf

oZL

2

2

,

||

11 2

),,,(exp)(

1log

Page 54: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Voted Perceptron Sequence Models

before as ),,,(),( where

),(),( :k

),,,(expmaxarg

i instances, trainingallfor

:econvergenc toIterate

0k :zero toparameters Initialize

},{ :data ningGiven trai

1

)()()(

1

k

)(

tossfosC

osCosC

tossfs

so

ttt

kk

iViterbik

iikk

t kttkksViterbi

i

[Collins 2001; also Hofmann 2003, Taskar et al 2003]

Avoids the tricky math; very fast; uses “pseudo-negative” examples of sequences; approximates a margin classifier for “good” vs “bad” sequences

Analogous tothe gradientfor this onetraining instance

Page 55: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Broader Issues in IE

Page 56: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IE

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Up to now we have been focused on segmentation and classification

Page 57: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

Page 58: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(1) Association as Binary Classification

[Zelenko et al, 2002]

Christos Faloutsos conferred with Ted Senator, the KDD 2003 General Chair.

Person-Role (Christos Faloutsos, KDD 2003 General Chair) NO

Person-Role ( Ted Senator, KDD 2003 General Chair) YES

Person Person Role

Do this with SVMs and tree kernels over parse trees.

Page 59: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(1) Association with Finite State Machines

[Ray & Craven, 2001]

… This enzyme, UBC6, localizes to the endoplasmic reticulum, with the catalytic domain facing the cytosol. …

DET thisN enzymeN ubc6V localizesPREP toART theADJ endoplasmicN reticulumPREP withART theADJ catalyticN domainV facingART theN cytosol Subcellular-localization (UBC6, endoplasmic reticulum)

Page 60: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(1) Association with Graphical Models[Roth & Yih 2002]Capture arbitrary-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofentity #2, e.g. over{person, location,…}

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

Page 61: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Random variableover the class ofentity #1, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

person?

personlives-in

Page 62: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(1) Association with Graphical Models[Roth & Yih 2002]Also capture long-distance

dependencies among predictions.

Local languagemodels contributeevidence to entityclassification.

Random variableover the class ofentity #1, e.g. over{person, location,…}

Local languagemodels contributeevidence to relationclassification.

Random variableover the class ofrelation between entity #2 and #1, e.g. over {lives-in, is-boss-of,…}

Dependencies between classesof entities and relations!

Inference with loopy belief propagation.

location

personlives-in

Page 63: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Broader ViewCreate ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

12

3

4

5

When do two extracted stringsrefer to the same object?

Page 64: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) Learning a Distance Metric Between Records[Borthwick, 2000; Cohen & Richman, 2001; Bilenko & Mooney, 2002, 2003]

Learn Pr ({duplicate, not-duplicate} | record1, record2)with a Maximum Entropy classifier.

Do greedy agglomerative clustering using this Probability as a distance metric.

Page 65: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) String Edit Distance distance(“William Cohen”, “Willliam Cohon”)

W I L L I A M _ C O H E N

W I L L L I A M _ C O H O N

C C C C I C C C C C C C S C

0 0 0 0 1 1 1 1 1 1 1 1 2 2

s

t

op

cost

alignment

Page 66: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) Computing String Edit Distance

D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E N

M 1 2 3 4 5

C 1 2 3 4 5

C 2 3 3 4 5

O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3

A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

learntheseparameters

Page 67: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) String Edit Distance Learning

Precision/recall for MAILING dataset duplicate detection

[Bilenko & Mooney, 2002, 2003]

Page 68: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) Information Integration

Goal might be to merge results of two IE systems:

Name: Introduction to Computer Science

Number: CS 101

Teacher: M. A. Kludge

Time: 9-11am

Name: Data Structures in Java

Room: 5032 Wean Hall

Title: Intro. to Comp. Sci.

Num: 101

Dept: Computer Science

Teacher: Dr. Klüdge

TA: John Smith

Topic: Java Programming

Start time: 9:10 AM

[Minton, Knoblock, et al 2001], [Doan, Domingos, Halevy 2001],[Richardson & Domingos 2003]

Page 69: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(2) Other Information Integration Issues

Distance metrics for text – which work well? [Cohen, Ravikumar, Fienberg, 2003]

Finessing integration by soft database operations based on similarity [Cohen, 2000]

Integration of complex structured databases: (capture dependencies among multiple merges) [Cohen, MacAllister, Kautz KDD 2000; Pasula, Marthi,

Milch, Russell, Shpitser, NIPS 2002; McCallum and Wellner, KDD WS 2003]

Page 70: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Broader View

Create ontology

SegmentClassifyAssociateCluster

Load DB

Spider

Query,Search

Data mine

IETokenize

Documentcollection

Database

Filter by relevance

Label training data

Train extraction models

Now touch on some other issues

1

1

2

3

4

5

Page 71: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

(5) Working with IE Data Some special properties of IE data:

It is based on extracted text It is “dirty”, (missing extraneous facts, improperly normalized

entity names, etc.) May need cleaning before use

What operations can be done on dirty, unnormalized databases? Datamine it directly. Query it directly with a language that has “soft joins” across

similar, but not identical keys. [Cohen 1998] Use it to construct features for learners [Cohen 2000] Infer a “best” underlying clean database

[Cohen, Kautz, MacAllester, KDD2000]

Page 72: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Evaluating IE Accuracy

Always evaluate performance on independent, manually-annotated test data not used during system development.

Template Measure for each test document: Total number of correct extractions in the solution template: N Total number of slot/value pairs extracted by the system: E Number of extracted slot/value pairs that are correct (i.e. in the

solution template): C Compute average value of metrics adapted from IR:

Recall = C/N Precision = C/E F-Measure = Harmonic mean of recall and precision

Page 73: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

MUC Information Extraction:State of the Art c. 1997

NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production

Page 74: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Summary and prelude

We’ve looked at the “fragment extraction” task. Future? Top-down semantic constraints (as well as syntax)? Unified framework for extraction from regular & natural text?

(BWI is one tiny step; Webfoot [Soderland 1999] is another.) Beyond fragment extraction:

Anaphora resolution, discourse processing, ... Fragment extraction is good enough for many Web information

services! Next time:

Learning methods for information extraction

Page 75: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Three generations of IE systems

Hand-Built Systems – Knowledge Engineering [1980s– ] Rules written by hand Require experts who understand both the systems and the

domain Iterative guess-test-tweak-repeat cycle

Automatic, Trainable Rule-Extraction Systems [1990s– ] Rules discovered automatically using predefined templates,

using methods like ILP Require huge, labeled corpora (effort is just moved!)

Machine Learning (Sequence) Models [1997 – ] One decodes a statistical model that classifies the words of the

text, using HMMs, random fields or statistical parsers Learning usually supervised; may be partially unsupervised

Page 76: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Basic IE References

Douglas E. Appelt and David Israel. 1999. Introduction to Information Extraction Technology. IJCAI 1999 Tutorial. http://www.ai.sri.com/~appelt/ie-tutorial/

Kushmerick, Weld, Doorenbos: Wrapper Induction for Information Extraction,IJCAI 1997. http://www.cs.ucd.ie/staff/nick/

Stephen Soderland: Learning Information Extraction Rules for Semi-Structured and Free Text. Machine Learning 34(1-3): 233-272 (1999)

Page 77: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Some IE tools Available

MALLET (UMass) statistical natural language processing, document classification, clustering, information extraction

other machine learning applications to text.

Sample Application:

GeneTaggerCRF: a gene-entity tagger based on MALLET (MAchine Learning for LanguagE Toolkit). It uses conditional random fields to find genes in a text file.

Page 78: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

http://minorthird.sourceforge.net/ “a collection of Java classes for storing text, annotating text, and

learning to extract entities and categorize text” Stored documents can be annotated in independent files using

TextLabels (denoting, say, part-of-speech and semantic information)

MinorThird

Page 79: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

GATE

http://gate.ac.uk/ie/annie.html

leading toolkit for Text Mining distributed with an Information Extraction component set called ANNIE (demo) Used in many research projects

Long list can be found on its website Under integration of IBM UIMA

Page 80: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Sunita Sarawagi's CRF package

http://crf.sourceforge.net/ A Java implementation of conditional random fields for sequential labeling.

Page 81: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

UIMA (IBM)

Unstructured Information Management Architecture. A platform for unstructured information management

solutions from combinations of semantic analysis (IE) and search components.

Page 82: Information Extraction October 13, 2006. What is Information Extraction? Input:  Specification: Types of entities to find Types of relations to find

Some Interesting Website based on IE

ZoomInfo CiteSeer.org (some of us using it everyday!)

Google Local, Google Scholar and many more…