View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Srihari-CSE635-Fall 2002
CSE 635Multimedia Information Retrieval
Information ExtractionInformation Extraction
Srihari-CSE635-Fall 2002
Overview
Introduction to IE
Named Entity tagger HMM approach
Relationship/Event detection
Text Mining intelligence applications
Introduction to IE
Named Entity tagger HMM approach
Relationship/Event detection
Text Mining intelligence applications
Srihari-CSE635-Fall 2002
Information Extraction
What is IE The identification of instances of a particular class of events or relationships
in a natural language text, and the extraction of the relevant arguments of the event or relationship. (MUC, de facto)
Information Extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text. (Grishman 1997)
identification of key entities, relationships between them, and significant activity involving these entities (Srihari)
Goals of IE transform unstructured text into structured/semi-structured text
automatic template-filling automatically populate databases facilitate information discovery
sometimes, what you don’t know is most important; if you know what you are looking for, use a search engine! IE permits information discovery
What is IE The identification of instances of a particular class of events or relationships
in a natural language text, and the extraction of the relevant arguments of the event or relationship. (MUC, de facto)
Information Extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text. (Grishman 1997)
identification of key entities, relationships between them, and significant activity involving these entities (Srihari)
Goals of IE transform unstructured text into structured/semi-structured text
automatic template-filling automatically populate databases facilitate information discovery
sometimes, what you don’t know is most important; if you know what you are looking for, use a search engine! IE permits information discovery
Srihari-CSE635-Fall 2002
Information to Intelligence
UnstructuredData
Information
Intelligence
PeopleCompany
Product
INTC drops X%
Microsoft, Lockheed eye federal deals
C-bridge, eXcelon to merge
RF Micro Devices Introduces Cellular CDMA LNA and PA Driver Amplifier with Bypass Switch
Transmeta Scores Latest Crusoe Win with Sharp
Ronald Brumback Named Pres. & COO of Top Layer Networks
Top INTC executive, John Doe, leaves to join Transmeta as VP Engineering
FedEx to Cut 130 Jobs in Texas
What’s new from RFMD?
What caused INTC shares to drop?
Entities, relationships, events
Text mining, analytics
Srihari-CSE635-Fall 2002
Levels of Information Extraction
MUC identifies the following levels of extraction: Named Entity Tagging
Bill Gates is the chairman of Microsoft
Relationship Detection: leads to entity profiles
chairman-of(Bill Gates, Microsoft)
Event Detection executive change person_in, person_out company_involved date
Scenario Extraction Bombing incident where # of casualties reason follow-up events involved: ordered sequentially
MUC identifies the following levels of extraction: Named Entity Tagging
Bill Gates is the chairman of Microsoft
Relationship Detection: leads to entity profiles
chairman-of(Bill Gates, Microsoft)
Event Detection executive change person_in, person_out company_involved date
Scenario Extraction Bombing incident where # of casualties reason follow-up events involved: ordered sequentially
Srihari-CSE635-Fall 2002
Named Entity Tagging
Bridgestone Sports Co. said Friday it has set up a joint venture in Hong Kong with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.
The joint venture, Bridgestone Sports Hong Kong Co., capitalized at 20 million Hong Kong dollars, will start production in January 1990 with production of 20,000 iron and "metal wood" clubs a month. The monthly output will be later raised to 50,000 units, Bridgestone Sports spokesman Tom White said.
The new company, based in Kaohsiung, southern Hong Kong , is owned 75 pct by Bridgestone Sports, 15 pct by Union Precision Casting Co. of Hong Kong and the remainder by Taga Co., a company active in trading with Hong Kong, the officials said.
Srihari-CSE635-Fall 2002
Output of Named Entity Tagger
<company> Bridgestone Sports Co. </company> said <date> Friday </data> it has set up a joint venture in <city>Hong Kong </city> with a local concern and a <ethnic> Japanese </ethnic> trading house to produce golf clubs to be shipped to <country> Japan </country>.
The joint venture, <company> Bridgestone Sports Hong Kong Co. </company>, capitalized at <money> 20 million Hong Kong dollars </money>, will start production in <date> January 1990 </date> with production of 20,000 iron and "metal wood" clubs a month.The monthly output will be later raised to 50,000 units, <company> Bridgestone Sports </company> spokesman <man> Tom White </man>, said.
Srihari-CSE635-Fall 2002
Named-Entity Definition
• Named-entity is a word or phrase that denotes a proper name such as person, organization, location, product, temporal expression and numerical expression.
• Name classes are associated with individual words.
• A named-entity is associated with a contiguous word sequence with the same name class.
Srihari-CSE635-Fall 2002
Entity Profiles
<Person Profile id=1>:<Person Profile id=1>:name: Waleed Alshehri
aliases: Waleed
position: a Saudi commercial pilot
age: mid-20s
gender: MALE
education: Embry - Riddle Aeronautical University;
FlightSafety Academy
associations: Satam Al Suqami ;
Wail Alshehri ;
Homing Inn;
American Flight 11
Events-involved: < graduated>;
<hijacking>;
< suicide attack>;
descriptors: quiet and private;
Middle Eastern backgrounds;
another of the eventual hijackers;
Srihari-CSE635-Fall 2002
Event Detection Event: <MOVEMENT>
who: 23 foreign fighters
whereto: into Pakistan
Location: Pakistan, Afghanistan
When: normalStr=020622 Monday
Snippet:
Pakistan said Monday its troops arrested 23 foreign fighters trying to cross from
Afghanistan into Pakistan over the weekend.
Event: <CONTRACT>
Money_involved: £5.9 million ($8.9 million)
Who: CVF Team, Thomson–CSF,Lockheed Martin, Raytheon, BMT Defense Services, Defense Procurement Agency
When: normalStr=021100 last November
Snippet:
The BAE Systems-led CVF Team and a rival Thomson-CSF group, including
Lockheed Martin, Raytheon and BMT Defense Services, were awarded parallel £5.9
million ($8.9 million) contracts by the Defense Procurement Agency last November
to undertake first-stage assessment phase work for CVF.
Event: <MOVEMENT>
who: 23 foreign fighters
whereto: into Pakistan
Location: Pakistan, Afghanistan
When: normalStr=020622 Monday
Snippet:
Pakistan said Monday its troops arrested 23 foreign fighters trying to cross from
Afghanistan into Pakistan over the weekend.
Event: <CONTRACT>
Money_involved: £5.9 million ($8.9 million)
Who: CVF Team, Thomson–CSF,Lockheed Martin, Raytheon, BMT Defense Services, Defense Procurement Agency
When: normalStr=021100 last November
Snippet:
The BAE Systems-led CVF Team and a rival Thomson-CSF group, including
Lockheed Martin, Raytheon and BMT Defense Services, were awarded parallel £5.9
million ($8.9 million) contracts by the Defense Procurement Agency last November
to undertake first-stage assessment phase work for CVF.
Srihari-CSE635-Fall 2002
3 Major Approaches to IE
Layout-based wrapper induction application focused: e.g. jobs database, processing resumes,
etc.
IR-based “concept” extraction uses techniques such as pattern matching, proximity, co-
occurrence often seen in Knowledge Management applications (e.g.
hardware)
NLP-based statistical techniques (POS tagging, NE tagging) grammatical techniques more sophisticated levels of IE possible
Layout-based wrapper induction application focused: e.g. jobs database, processing resumes,
etc.
IR-based “concept” extraction uses techniques such as pattern matching, proximity, co-
occurrence often seen in Knowledge Management applications (e.g.
hardware)
NLP-based statistical techniques (POS tagging, NE tagging) grammatical techniques more sophisticated levels of IE possible
Srihari-CSE635-Fall 2002
Convergence of NLP-driven and IR-driven Approaches to IE
InformationExtraction
Layout-based
IR-basedNLP-
Based
Entities
Relationships
Events
* Grammars
* StatisticalLanguage Models
Tag key phrases in context
Associate key phraseswith entities
* Lexical Lookups
* Word Co-0ccurence
* Heuristics
Concept Tagging
Domain-specificEvent Detection
* Expert Lexicons
* Lexicon Grammars
Generic Domain-Specific
Focus on
Precision
Focus on Recall
Srihari-CSE635-Fall 2002
Challenges in IE
Normalization temporal references (today, last year, during the Olympics …) spatial references (Buffalo)
Alias resolution George Bush, President Bush IBM, “the company”
Verb concepts kill, murder, assassinate, etc.
Diversity of sources web documents, e-mail, powerpoint, speech/OCR transcripts sophisticated pre-processing required
Cross-document information consolidation Rapid domain porting Intuitive user interface
should support decision making work flow, visualization, etc.
Normalization temporal references (today, last year, during the Olympics …) spatial references (Buffalo)
Alias resolution George Bush, President Bush IBM, “the company”
Verb concepts kill, murder, assassinate, etc.
Diversity of sources web documents, e-mail, powerpoint, speech/OCR transcripts sophisticated pre-processing required
Cross-document information consolidation Rapid domain porting Intuitive user interface
should support decision making work flow, visualization, etc.
Srihari-CSE635-Fall 2002
Homeland Defense: Track Key Entities Based on Watch Lists
Reports
Information Discovery Portal
Associations
Who/what is being associated with al-
Qaeda ?
Organizations Religious Political Terrorist - al-Jihad (34) - HAMAS (16) - Hizballah (5) - …morePeopleIncidents - Attacks (125) - Bombing (64) - Threats (45) - …moreLocationsWeaponsGovernments
Overall Coverage
Events Info. Sources Documents
Track... Organizations People Targets
al-Qaeda
Overall Coverage of al-Qaeda Over Time
0
10
20
30
40
50
# R
epor
ts
Alerts for Week of August 6, 2001
(3) new reports of al-Qaeda terrorist activity(1) new report of bin Laden sighting(4) new quotes by bin Laden(1) new target identified
Reports
Information Discovery Portal
Associations
Who/what is being associated with al-
Qaeda ?
Organizations Religious Political Terrorist - al-Jihad (34) - HAMAS (16) - Hizballah (5) - …morePeopleIncidents - Attacks (125) - Bombing (64) - Threats (45) - …moreLocationsWeaponsGovernments
Overall Coverage
Events Info. Sources Documents
Track... Organizations People Targets
al-Qaeda
Overall Coverage of al-Qaeda Over Time
0
10
20
30
40
50
# R
epor
ts
Alerts for Week of August 6, 2001
(3) new reports of al-Qaeda terrorist activity(1) new report of bin Laden sighting(4) new quotes by bin Laden(1) new target identified
DiscoverOther
Related Information
Srihari-CSE635-Fall 2002
Name-Class Definition
OR: organization CO: company
“Bridgestone Sports Co.”, “Bridgestone Sports Hong Kong Co.”, “Bridgestone Sports”
LO: location CI: city “Hong Kong”, CT: country “Japan”
PE: person MAN: man “Tom White”
TI: time DA: date “Friday”
NN : not name“said”, “it has set up a joint venture”, “with a local concern and a ”, “trading house to produce golf “
Srihari-CSE635-Fall 2002
Name-Class Tree
There are 6 top-level name-classes, and 35 sub-type name-classes.
Time -- Hour, Part Day, Duration,Frequency, Age, Day, Month, Season, Year, Decade, Century
Location -- City, Province, Country, Continent, Ocean, lake, River, Mountain, Road, Region, District, Airport
Organization --Company, Government, Association, School, Army, Mass Media
Person -- Man, Woman
Product -- Vehicle, Software
Event -- Conference, Exhibition
Srihari-CSE635-Fall 2002
Application of Named Entity Tagging
• Question-Answering System
Q: Where did Bridgestone Sports Co. set up a joint venture?
A: Hong Kong
Q: When did Bridgestone Sports Hong Kong Co. start
production?
A: January 1990
Q: Who is the spokesman for Bridgestone Sports?
A: Tom White
Srihari-CSE635-Fall 2002
Question Asking Points and Named Entities
Where Location
Q: Where did Bridgestone Sports Co. set up a joint venture?
A: Hong Kong
When Time
Q: When did Bridgestone Sports Hong Kong Co. start
production?
A: January 1990
Who Person
Q: Who is the spokesman for Bridgestone Sports?
A: Tom White
Srihari-CSE635-Fall 2002
Application of Named Entity Tagging (condt.)
Support other Information Extraction tasks
Extract Correlated Entities (relationship):
entity 1: Tom White man
relation: employed by
entity 2: Bridgestone Sports company
Extract events:
predicate: start
argument 1: Bridgestone Sports Hong Kong Co company
argument 2: production
time: January 1990 date
Srihari-CSE635-Fall 2002
Other Applications of NE
Search engines
text categorization/filtering
data mining
Search engines
text categorization/filtering
data mining
Srihari-CSE635-Fall 2002
Statistical Model for Named Entity Tagging
Given a sequence of words (W), our goal is to find the sequence of name-class (NC) with maximum Pr(NC|W).
For example:
word sequence :
it has set up a joint venture in Hong Kong
Possible name-class sequence
it has set up a joint venture in Hong Kong
NN NN NN NN NN NN NN NN LO LO
LO NN NN NN NN NN NN NN OR LO
Sequence)W |Sequence Pr(NCargmax sequence nc
Srihari-CSE635-Fall 2002
Statistical Model for Named Entity Tagging (contd.)
• Construct a manually tagged training corpus.
• Extract necessary statistics from the corpus to build a statistical model which can automatically compute Pr(NC Seqeunce | W Sequence) for unseen data.
• Search the NC sequence which maximizes the probability Pr(NC Sequence | W Sequence)
Corpus Statistical Model unseen datatagging
Srihari-CSE635-Fall 2002
Statistical Model for Named Entity Tagging (contd.)
• The size of the training corpus is large enough to provide fairly good unigram and bigram information.
unigram example: Pr(Organization | “US”)
bigram example: Pr(Orgaization | “US”, “the”)
• The size of the training corpus is too small to support any direct evaluation beyond bigram.
• Question: How to evaluate Pr(NC Sequence| Sentence) based on the above unigram and bigram information.
• One solution: transfer the conditional probability into (NC,Sentence) joint probability (Bayes’ rule)
Decouple sentence into bigram sequences (Markov assumption)
Srihari-CSE635-Fall 2002
Bayes’ Rule
Using Bayes’ rule, we have
)Sequence, NCSequence,Pr(W argmax
Sequence)Pr(W
Sequence) NCSequence,Pr(W argmax
Sequence)W |Sequence Pr(NCargmax
sequence nc
sequence nc
sequence nc
Srihari-CSE635-Fall 2002
Markov Assumption
)nc, w,...,nc,w|nc,...Pr(w
)nc,w,nc,w|nc,)Pr(wnc,w|nc,)Pr(wnc,Pr(w
)nc, w,...,nc,wnc,Pr(w
Sequence)W Sequence, Pr(NC
001-n1-nnn
001122001100
001-n1-nn,n
)nc,w|nc,Pr(w)nc, w,...,nc,w|nc,Pr(w
...............................................
)nc,w|ncPr(w)ncwnc,w|ncPr(w
1-n1-nn1-n001-n1-nn1-n
1122,00,1,122,
By Markov assumption, we have
Srihari-CSE635-Fall 2002
Markov Assumption (condt.)
So the final formula is
)nc,w|nc,...Pr(w
)nc,w|nc,)Pr(wnc,w|nc,)Pr(wnc,Pr(w
Sequence)W Sequence, Pr(NC
1-n1-nnn
1122001100
Srihari-CSE635-Fall 2002
Hidden Markov Model
Define Hidden Markov Model as follows:
1. An output alphabet Ή={0,1,…V-1}2. A state space ф={1,2,…c};3. A transition probability distribution between states and associated
output symbols p(symboln, staten | symboln-1, staten-1).
In case of named entity tagging, regard word as output symbol, and the tags as the states. The above statistical NE model is a Hidden Markov Model.
W1 W2 W3 W4 …..
<SS> PE PE PE PE
LO LO LO LO
OR OR OR OR
Srihari-CSE635-Fall 2002
Statistics Estimation
The generation of words and name-class proceeds in three steps:
1-nn1-n1-nn
1-nn1-nnn1-n1-nn1-n1-nnn ncnc)nc ,w| Pr(w
ncnc)nc,nc|wPr()nc ,w|Pr(nc)nc ,w|nc ,Pr(w
The Most Likelihood Estimation (MLE) of the above probabilities are as follows:
)nc ,C(w
)nc ,w,C(w)nc ,w| Pr(w
)nc ,C(nc
)nc ,nc,C(w)nc,nc|wPr(
)nc ,C(w
)nc ,w,C(nc)nc ,w|Pr(nc
1-n1-n
1-n1-n1-n1-n1-nn
1-nn
1-nn1-n1-nnn
1-n1-n
1-n1-nn1-n1-nn
Srihari-CSE635-Fall 2002
Easy and Difficult Cases
Some cases are easy Matsushita Electric Industrial Co. has reached agreement … Victor C. of Japan (JVC) and Sony Corp. ...
Some cases are particularly difficult: In a factory of Blaupunkt Weke, a Robert Bosch subsidiary, … Touch Panel Systems, capitalized at 50 million Yen is owned ...
Some cases are easy Matsushita Electric Industrial Co. has reached agreement … Victor C. of Japan (JVC) and Sony Corp. ...
Some cases are particularly difficult: In a factory of Blaupunkt Weke, a Robert Bosch subsidiary, … Touch Panel Systems, capitalized at 50 million Yen is owned ...
Srihari-CSE635-Fall 2002
Machine learning vs. handcrafted rules
Handcrafted finite state patterns can be very effective:
<proper-noun>+ <corporate designator> --> <corporation>e.g. Sony Corp.
Problems with handcrafted approach each new source requires tweaking, i.e. domain porting can be
tedious speech recognition transcript, OCR require modification of
rules rules for different languages are radically different
Machine learning approach more scalable exception: numerical expressions, other patterns which are
very regular, e.g. contact informationtelephone numbers, URLs, postal addresses, etc.
Handcrafted finite state patterns can be very effective:
<proper-noun>+ <corporate designator> --> <corporation>e.g. Sony Corp.
Problems with handcrafted approach each new source requires tweaking, i.e. domain porting can be
tedious speech recognition transcript, OCR require modification of
rules rules for different languages are radically different
Machine learning approach more scalable exception: numerical expressions, other patterns which are
very regular, e.g. contact informationtelephone numbers, URLs, postal addresses, etc.
Srihari-CSE635-Fall 2002
NE tagger- Bikel et al
PDF file PDF file
Srihari-CSE635-Fall 2002
Viterbi Search
Viterbi search algorithm is used to search the NC sequence which maximizes the following probability
W1 W2 W3 W4 …..
<SS> PE PE PE PE
LO LO LO LO
OR OR OR OR
Best paths reach nodes associated with w1 is self-clear.
3 paths reaches the node (W2, PE) : (PE PE –1.0), (LO,PE, -1.5), (OR,PE,-0.95). The best path reaching (W2,PE) is (OR,PE,-0.95)
Compute the best paths reaching the nodes associated with w2.
Keep the best reaching path only and continue the same computation to the next word.
-0.2
-1.2
-0.9
-0.8
-0.3
-0.05
Srihari-CSE635-Fall 2002
What next?
We know how to tag Nes locally. What next?
Alias resolution George W. Bush, President Bush, Bush
Relationship extraction affiliation spouse address
Event Detection
Entity Profiles
We know how to tag Nes locally. What next?
Alias resolution George W. Bush, President Bush, Bush
Relationship extraction affiliation spouse address
Event Detection
Entity Profiles
Srihari-CSE635-Fall 2002
Extracting relationships and events
Two major approaches grammatical statistical
Grammatical approaches requires SVO parsing, semantic parsing as a first step follow up by specialized relationship and event extraction
grammars Two approaches here also:
one behemoth grammar (CFG) cascaded, finite state grammars
Statistical approaches supervised learning approach unsupervised approach using extraction patterns
Two major approaches grammatical statistical
Grammatical approaches requires SVO parsing, semantic parsing as a first step follow up by specialized relationship and event extraction
grammars Two approaches here also:
one behemoth grammar (CFG) cascaded, finite state grammars
Statistical approaches supervised learning approach unsupervised approach using extraction patterns
Srihari-CSE635-Fall 2002
Architecture of InfoXtract Engine/Platform
DocumentProcessor
KnowledgeResources
LexiconResources
Grammars
Output Manager
Linguistic Modules
Tokenizer
Token ListLexicon Lookup
PragmaticFiltering
POS Tagging
Named EntityDetection
ShallowParsing
SemanticParsing
RelationshipDetection
NE
PE
CE
SVO
CO
Profile
GE
NumberNormalization
Alias/CoreferenceLinking
Time/locationNormalization
Profile/EventLinking
Profile/EventMerge
FST Module
Procedure orStatistical Model
HybridModule
NE: Named EntityCE: Correlated EntitySVO: Subject-Verb-ObjectCO: Co-referenceGE: General EventPE: Pre-defined EventPOS: Part Of SpeechFST: Finite State Transducer
WebServerZoned Text
Document
XML Formatted Extracted Document
HTTPPost
HTTPResponse
Document&
Error Log
ProcessManager
SourceDocument
Token List
HTTP
CORBA
Legend Natural Language Processing
Hybrid Model
Srihari-CSE635-Fall 2002
Adapting FSTs for NLP engines
Traditionally, FSTs have operated on character streams- both input and output
primarily used in lexical transducers
InfoXtract tokenizer converts input stream into tokenlist: all subsequent modules operate on tokenlist
tokenlist contains the following information: linguistic features (POS, semantic class from WordNet etc.) linguistic structures derived from NLP (e.g., SVO) information extraction output: NE, relationships, events pointers to tokens (text offsets) real objects (text strings) as well as virtual objects
FST grammars operate on tokenlists and can utilize features at several levels
character/string level, structure level equivalent to tree-walking automata
Traditionally, FSTs have operated on character streams- both input and output
primarily used in lexical transducers
InfoXtract tokenizer converts input stream into tokenlist: all subsequent modules operate on tokenlist
tokenlist contains the following information: linguistic features (POS, semantic class from WordNet etc.) linguistic structures derived from NLP (e.g., SVO) information extraction output: NE, relationships, events pointers to tokens (text offsets) real objects (text strings) as well as virtual objects
FST grammars operate on tokenlists and can utilize features at several levels
character/string level, structure level equivalent to tree-walking automata