View
216
Download
0
Tags:
Embed Size (px)
Citation preview
AAAI 2002 WS 1
Peppering knowledge sources
with SALT
Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby
Brigham Young University [email protected], {ding,embley}@cs.byu.edu,
(Boosting conceptual content for ontology generation)
AAAI 2002 WS 2
Acknowledgements Co-authors (Embley, Ding) EU Fifth Framework IST/HLT 3.4.1 NSF Information and Intelligent
Systems grant IIS-0083127 Gerhard Budin (Eurodicautom data) Sergei Nirenburg (Mikrokosmos
ontology)
AAAI 2002 WS 3
Outline Termbases and lexicons: (re)use(s) The SALT and TIDIE projects Data modeling and data resources Termbase conversion Ontology generation Results and evaluation Conclusions
AAAI 2002 WS 4
Termbases Terminology databases for humans
in multilingual documentation industry
Several models, formats; often concept-oriented in nature
Termium, Eurodicautom, etc.
AAAI 2002 WS 5
Lexicons NLP applications: IR, MT, NLU,
speech understanding Widely varying data formats Description at various levels of
linguistic theory
AAAI 2002 WS 6
Sharing resources Integration is the trend
Lexicons (OLIF for MT system lexicons) Termbases (MARTIF for human termbases) Lexicons and termbases
Needed: principled data-modeling approach Wide variety of information to be treated Wide range of formats currently in use
AAAI 2002 WS 7
The SALT project SALT: Standards-based Access service
to multilingual Lexicons and Terminologies (www.ttt.org/salt).
International cooperation, standards for coding and interchange of linguistic data, and the combining of technologies
Several partners (BYU TRG, KSU, etc.) Data modeling approach to addresses
the problem of interchange among diverse collections of such data, including their ontological substructure
AAAI 2002 WS 8
The SALT approach Goal: provide
1) Modularity differentiate core structure vs. data category
specification 2) Coherence
use a meta-model 3) Flexibility
Support interoperable alternative representations Modular meta-model approach
Implemented in various settings Ongoing refinement: model’s coverage
AAAI 2002 WS 9
The TIDIE project TIDIE: Target-based Independent-of-
Document Information Extraction (www.deg.byu.edu)
Ontology-based data extraction Conceptual modeling of real-world
applications Narrow, data-rich domains Leverage (or build) custom ontologies
for target-based extraction
AAAI 2002 WS 10
Information exchangeSource Target
InformationExtraction
SchemaMatching
Leveragethis …
… to dothis
AAAI 2002 WS 11
Information Extraction Examine/retrieve information from
documents to fill information from user-supplied template
Requires some user-oriented specification of information
Our approach: finding, extracting, structuring, and synthesizing information is easier given a conceptual-model-based ontology
AAAI 2002 WS 12
Extracting pertinent information from documents
AAAI 2002 WS 13
A Conceptual Modeling Solution
Year Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
AAAI 2002 WS 14
Car-Ads OntologyCar [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]
constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, …End;
AAAI 2002 WS 15
Recognition and Extraction
Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (363)835-85970002 1998 Elandra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081
Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stero0002 a/c0003 Auto0003 jade green0003 gold
AAAI 2002 WS 16
Lexical resources for data modeling Information extraction also requires knowledge
representations with terminological and conceptual content.
Extraction ontology knowledge sources must: be of a general nature contain meaningful relationships already exist in machine-readable form have a straightforward conversion into XML.
This paper: create, leverage large-scale termbase
some ontological structure reformatted according to the SALT standard converted into μK-compliant XML for use by the
ontology generator
AAAI 2002 WS 17
Eurodicautom Well-known, widely-used termbase
> 1 million concept entries Wide range of topics Entries are multilingual
Entry information: sources cited, input/approval dates, …
Single-word terms (e.g. “generator”) or multi-word expressions (e.g. “black humus”)
Entries each have Lenoch subject-area code Hierarchical representation for classifying terms (and
by extension their related concepts)
AAAI 2002 WS 18
Partial Eurodicautom entry%%CM AG4 CH6 GO6%%DA%%VE lavmosetørv%%RF A.Klougart%%EN%%VE black humus%%RF CILF,Dict.Agriculture,ACCT,1977%%IT%%VE humus nero%%RF BTB%%ES%%VE humus negro%%RF CILF,Dict.Agriculture,ACCT,1977%%SV%%VE sumpjord%%RF Mats Olsson,SLU(1997)
AAAI 2002 WS 19
Sample Lenoch codesAD Public Administration - Private Administration - OfficesAD1 general aspects of the subject fieldAD2 public and private organisationsAD3 publications & documentary searchAD31 documentation and information systemsAD4 administrative staffAD5 public procurement AD51 expropriation in the public interest
TEH testing methodsTEH1 general aspects of testing methodsTEH2 non-destructive testingTEH21 chemical testsTEH22 photometrical testingTEH221 X-ray spectrometrical testing
AAAI 2002 WS 20
Converting the termbase Use several thousand English terms and their subject
codes %%CM line lists three Lenoch codes:
AG4 (representing the subclass AGRONOMY), CH6 (representing ANALYTICAL-CHEMISTRY) GO6 (representing GEOMORPH-OLOGY).
Convert termbase entries via the SALT-developed TBX termbase exchange framework
XML-based refinement of MARTIF Convert to μK XML format used by ontology engine Result: TBX-mediated conversion from native
Eurodicautom terms to the final XML-specified ontology (μK)
Lenoch codes re-interpreted as typical hierarchical relations (e.g. IS-A and SUBCLASS)
AAAI 2002 WS 21
Conversion process
Eurodicautom
(native)
Lenoch
Eurodicautom
(TBX)
SALT
Eurodicautom
(μK)
AAAI 2002 WS 22
Eurodicautom-TBX encoding
<?xml version="1.0" encoding="UTF-8"?><martif xmlns="x-schema:XLTcsV04.xml" lang="en" type="DXLT"> <martifHeader> <fileDesc> <sourceDesc> <p>sample Eurodicautom entry</p> </sourceDesc> </fileDesc> <encodingDesc> <p type="DCSName">DXLTdv04.xml</p> </encodingDesc> </martifHeader> <text> <body> <termEntry id="eid-EDIC-BTB-DAG77-63"> <admin type="originatingInstitution">BTB</admin> <admin type="projectSubset">DAG77</admin> <descrip type="reliabilityCode">4</descrip> <langSet lang="pt"> <ntig> <termGrp> <term>souto</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> <ntig> <termGrp> <term>minifúndio</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> </langSet> </termEntry>
AAAI 2002 WS 23
Derived XML ontology<RECORD> <CONCEPT>xenobiotic substances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>hazardous raw materials </FILLER> <UID>0</UID></RECORD><RECORD> <CONCEPT>physical nuisances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>ambient light</FILLER> <UID>0</UID></RECORD><RECORD> <CONCEPT>financial statistics</CONCEPT> <SLOT>IS-A</SLOT> <FACET>VALUE/FACET> <FILLER>economic statistics</FILLER> <UID>0</UID></RECORD>….
AAAI 2002 WS 24
Ontology generation Goal: specify an ontology for
information extraction purposes Problem: complex, tedious, costly Ideally: automatically generate
schemas, ontologies Source: natural-language text,
tables, etc.
AAAI 2002 WS 25
Ontology generation overview
AAAI 2002 WS 26
Knowledge sources Mikrokosmos (μK) ontology
About 5,000 hierarchically-arranged concepts Fairly high connectivity ( about 14 inter-concept links
per node) Fairly general content, inheritance of properties
Data frame library regular-expression templates for matching
structured low-level lexical items (e.g. measurements, dates, currency expressions, and phone numbers)
provide information for conceptual matching via inheritance
Lexicons (e.g. onomastica, WordNet synsets) Domain-specific training documents
AAAI 2002 WS 27
Knowledge integration
AAAI 2002 WS 28
Methodology Preprocess input knowledge sources: Integrate: map lexicon content and data frame
templates to nodes in the merged ontology Extract: match information from training
documents collection Parse, tokenize, regularize lexical content
Generate the ontology: four-stage generation process
concept selection relationship retrieval constraint discovery refinement of the output ontology
AAAI 2002 WS 29
Processing input documents
AAAI 2002 WS 30
Concept selection Finding which subset of the ontology’s
concepts is of interest to a user Concepts are selected via string matches
between textual content and the ontological data.
Three different selection heuristics concept-name matching concept-value matching data-frame pattern matching
String matching plus: word synonym matching: WordNet synonym sets multi-word term matching: bag-of-words
(CAPITAL-CITY is considered a synonym of capital and city)
AAAI 2002 WS 31
Concept selection algorithmPROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-
Library); ConflictHandling(); SelectedSubgraphGeneration();
AAAI 2002 WS 32
Basic Selection Strategy Select from Mikrokosmos
Ontology
Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7 million. Agriculture: Wheat, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
AAAI 2002 WS 33
Basic Selection Strategy Select from Mikrokosmos
Ontology concept names and their
synonyms
Afghanistan smaller than Texas. Area<GeographicalArea>:
648,000 sq. km. Capital<CapitalCity><Financi
alCapital>--Kabul, Other cities--Kandahar Mazar-e-Sharif
Konduz Terrain: Landlocked; mostly mountains
and desert. Climate: Dry, with cold winters and hot
summers.
Population<Population>:17.7 million.
Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
AAAI 2002 WS 34
Select from Mikrokosmos Ontology concept names and their
synonyms concept values and their
synonyms
Afghanistan<Nation> smaller than
Texas<USState>. Area<GeographicalArea>:
648,000 sq. km. Capital<CapitalCity><Financi
alCapital>--Kabul<CapitalCity>,
Other cities--Kandahar Mazar-e-Sharif Konduz
Terrain: Landlocked; mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population<Population>:17.7 million.
Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy
AAAI 2002 WS 35
Select from Mikrokosmos Ontology concept names and their
synonyms concept values and their
synonyms Select from Data Frame
Libraries
Afghanistan smaller than Texas. Area: 648,000 sq. km. Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7 million. Agriculture: Wheat, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy
AAAI 2002 WS 36
Select from Mikrokosmos Ontology concept names and their
synonyms concept values and their
synonyms Select from Data Frame
Libraries extract result based on the
data frames
Afghanistan smaller than Texas. Area:
648,000<Area><Mileage> sq. km.
Capital--Kabul, Other cities--Kandahar
Mazar-e-Sharif Konduz Terrain: Landlocked;
mostly mountains and desert.
Climate: Dry, with cold winters and hot summers.
Population:17.7<Time> million<Population><Price>.
Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
Basic Selection Strategy
AAAI 2002 WS 37
Concept conflict resolution Arrive at an internally consistent set of
selected concepts. Two levels of resolution
Document-level resolution Knowledge-source resolution
Criteria: lexical occurrence, proximity, length and distribution of words and terms
Preferences from among knowledge sources specifying matches
Other default strategies
AAAI 2002 WS 38
Document-Level Conflict Afghanistan smaller than Texas. Area: 648,000<Area><Mileage> sq. km. Capital<CapitalCity><FinancialCapital>--
Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population:17.7<Time> million<Population><Price>. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
AAAI 2002 WS 39
Concept-Level Conflict Afghanistan smaller than Texas. Area<GeographicalArea>: 648,000<Area> sq. km. Capital--Kabul, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population<Population>: 17.7 million<Population>. Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn,
barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton.
AAAI 2002 WS 40
Relationship retrieval Ontology structure: directed graph, nodes are concepts Conceptual relationship: all paths connecting concepts
generated at given stage Theoretical solution: find all the paths in the graph (NP-
complete) When multiple paths do exist, take the shortest path
between 2 concepts (Cf. μK Onto-Search algorithm) Dijkstra’s (polynomial) algorithm to compute the most
salient relationships between concepts Distance threshold on path length to prune weak
relationships Construct schemas, or linked conceptual configurations,
from the relationships posited in the previous step. Primary concept selected (or posited): highest connectivity Cardinalities inferred from observed relationships
AAAI 2002 WS 41
Participation Constraints Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital—Kabul<CapitalCity>, Other cities--Kandahar Mazar-e-Sharif Konduz Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1]
AAAI 2002 WS 42
Participation Constraints (2) Afghanistan<Nation> smaller than Texas. Area: 648,000 sq. km. Capital--Kabul<City>, Other cities<City>--Kandahar<City> Mazar-e-Sharif<City>
Konduz<City> Terrain: Landlocked; mostly mountains and desert. Climate: Dry, with cold winters and hot summers. Population: 17.7 million. Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts,
karakul pelts, wool, mutton.
City [1:1] PartOf Nation [1:*]
AAAI 2002 WS 43
Refining results Output ontology: may require hand-crafting
can be done in a text editor (flat ASCII ontology) Considerable expertise required:
markup syntax specification of conceptual relations. familiarity with regular-expression writing
Possible solution: ontology editors for typical end-users
With rich enough knowledge sources and a good set of training documents, however, we believe that the generation of extraction ontologies can be fully automatic.
AAAI 2002 WS 44
Testing the system Input: various of U.S. Department
of Energy abstracts Knowledge base:
μK ontology Energy sub-hierarchy of
Eurodicautom terms (300)
AAAI 2002 WS 45
Sample application document
The trend in supply and demand of fuel and the fuels for electric power generation, iron manufacturing and transportation were reviewed from theliterature published in Japan and abroad in 1986. FY 1986 was a turning point in the supply and demand of energy and also a serious year for them because the world crude oil price dropped drastically and the exchange rate of yen rose rapidly since the end of 1985 in Japan as well. The fuel consumption for steam power generation in FY 1986 shows the negative growth for two successive years as much as 98.1%, or 65,730,000 kl in heavy oil equivalent, to that in the previous year. The total energy consumption in the iron and steel industry in 1986 was 586 trillion kcal (626 trillion kcal in the previous year). The total sales amount of fuel in 1986 was 184,040,000 kl showing a 1.5% increase from that in the previous year. The concept Best Mix was proposed as the ideal way in the energy industry. (21 figs, 2 tabs, 29 refs)
AAAI 2002 WS 46
Sample output-- energy2 Information Ontologyenergy2 [-> object];energy2 [0:*] has Alloy [1:*];energy2 [0:*] has Consumption [1:*];energy2 [0:*] has CrudeOil [1:*];energy2 [0:*] has ForProfitCorporation [1:*];energy2 [0:*] has FossilRawMaterials [1:*];energy2 [0:*] has Gas [1:*];energy2 [0:*] has Increase [1:*];energy2 [0:*] has LinseedOil [1:*];energy2 [0:*] has MetallicSolidElement [1:*];energy2 [0:*] has Ores [1:*];energy2 [0:*] has Produce [1:*];energy2 [0:*] has RawMaterials [1:*];energy2 [0:*] has RawMaterialsSupply [1:*];Alloy [0:*] MadeOf.SOLIDELEMENT.Subclasses MetallicSolidElement [0:*];Alloy [0:*] IsA.METAL.StateOfMatter.SOLID.Subclasses CrudeOil [0:*];Alloy [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*];AmountAttribute [0:*] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNITConsumption [0:*] IsA.FINANCIALEVENT.Agent Human [0:*];ControlEvent [0:*] IsA.SOCIALEVENT.Agent Human [0:*];ControlEvent [0:*] IsA.SOCIALEVENT.Location.PLACE.Subclasses Nation [0:*];CountryName [0:*] NameOf Nation [0:*];CountryName [0:*] IsA.REPRESENTATIONALOBJECT.OwnedBy Human [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.OwnedBy Human [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Combine [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Display [0:*];CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*];Custom [0:*] IsA.ABSTRACTOBJECT.ThemeOf.MENTALEVENT.Subclasses AddUp [0:*];Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.Subclasses Gas [0:*];Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.OwnedBy Human [0:*];ForProfitCorporation [0:*] OwnedBy Human [0:*];ForProfitCorporation [0:*] IsA.CORPORATION.HasNationality Nation [0:*];Gas [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*];Gas [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*];LinseedOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*];
AAAI 2002 WS 47
Evaluation Several dozen relationships are generated
Correct: relationship is posited between the concept CRUDE-OIL and the action PRODUCE; the role is Theme, meaning that one can PRODUCE CRUDE-OIL
Incorrect: relationship between GAS and GROW Precision: relatively low (around 75%) due to
high number of matches Recall: better (around 90%) Note: it’s easier for a human to refine the
system’s output by rejecting spurious relationships (i.e. deleting false positives) than to specify relationships that the system has missed.
AAAI 2002 WS 48
How to improve results Less general, more focused
ontologies Richer ontological structure
More types of hierarchical relationships (beyond IS-A and its inverse, SUB-CLASSES)
Deeper hierarchies (maximum 4 in Lenoch)
Note: TBX supports several data types for conceptual encoding
AAAI 2002 WS 49
Related work Lexical chaining in NLP
extracting and associating chains of word-based relationships from text
relating words and terms to resources like WordNet
Widely used in text categorization, automatic summarization, and topic detection and tracking
Our contributions: integrating disparate knowledge sources for
similar tasks Discovering and generating a compatible
set of ontological relationships
AAAI 2002 WS 50
Conclusions The knowledge acquisition bottleneck impacts ontology
construction for information extraction. Terminographers and lexicographers codify information
that can be advantageous for work in semantic-based processing.
Integrating these two disparate areas, it is possible to leverage large-scale terminological and conceptual information with relationship-rich semantic resources in order to reformulate, match, and merge retrieved information of interest to a user.
Possible future applications: Knowledge-focused personal agents Customized search, filtering, and extraction tools Individually tailored views of data via integration,
organization, and summarization Lots of work still to be done…