Upload
ashlyn-bond
View
221
Download
4
Tags:
Embed Size (px)
Citation preview
Catherine Legg & Sam SarjantUniversity of Waikato
2 July, 2012
Bill Gates is not a Parking Meter: Philosophical Quality Control in Automated
Ontology-building
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
What is wrong with this statement?
S1: The number 8 is a very red number.
This seems somehow worse than false.
For a false statement can be negated to produce a truth, but:
S2: The number 8 is not a very red number.
doesn’t seem right either.
The problem seems to be that numbers are not the kind of thing that can have colours. If someone thinks so then they don’t understand what kinds of things numbers are.
Traditional philosophical terminology for what is wrong is that S1 commits a category mistake.
Fregean quibbles notwithstanding
The odd synaesthesia episode notwithstanding
It is well-known that the philosophical discipline of ontology was invented by Aristotle to deal with these kinds of problems (among others):
όντος: (being) +
λόγος: (theory of)
It was called by him first philosophy (i.e. the fundamental science). A key part of its role was to define categories.
Categories are intended to describe the most different kinds of things that exist in reality: Suggested examples include:
Physical objects
Times
Events Relationships
Numbers
Traditionally ontologies were built into a taxonomic structure, or ‘tree of knowledge’:
‘Tree of Porphyry’ (3rd century A.D.)
• There is a subtle but important distinction between philosophical categories and mere properties.
• Although both divide entities into groups, and may be represented by classes, categories provide a deeper, more sortal division which enforces constraints.
• E.g. while we know the same thing cannot be a colour and a number, the same cannot be said for green and square.
• However, at what ‘level’ of an ontology constraining categorical divisions give way to non-constraining property divisions is frequently unclear and contested.
• This has led to skepticism about philosophical categories.
• Plus the disdain of “speculative metaphysics” by C20th logical positivism (and various successors) didn’t help.
Category vs. Property Distinctions:
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
• A wealth of new free user-supplied Web content (‘Web 2.0’) for raw data. (Blogs, tags, Wikipedia…)
• New automated methods of determining semantic relatedness (e.g. Milne & Witten, “An Effective, Low-cost measure of Semantic Relatedness Obtained from Wikipedia links”,.2008)
• Manual methods widely agreed to be too slow and labour-intensive (e.g. Cyc project, 25 years, still unfinished)
Why Automated Ontology-Building?
What Has Been Done So Far?• YAGO (Wordnet backbone, enriched with Wikipedia’s leaf categories.
Taxonomic)
• DBPedia (Wikipedia Infoboxes and other semi-structured data. Not taxonomic)
• EMLR (Wikipedia category network, enriched with further relations from, e.g., parsing category names, e.g. BornIn1954. Taxonomic)
• Use as backbone the Cyc taxonomy, because:
• Highly principled as the result of so many years’ labor
• Purpose-built inference engine to reason over the knowledge
• Now publically available ‘open source’ (ResearchCyc)
• One researcher (Legg) had inside knowledge of the system, which is rare worldwide
• Build onto it knowledge mined from Wikipedia, because:
• Access to Wikipedia-based automated measure of semantic relatedness developed at University of Waikato
• Wikipedia is astounding!
• 2.4M (English) articles, referred to by 3M different terms
• ~25 hyperlinks per article
• 175K templates for semi-structured data-entry (inc.9K ‘infoboxes’)
• full editing history for every article….etc, etc.
Our Choice
Cyc Ontology & Knowledge Base
OpenCyc contains:~13,500 Predicates~200,000 Concepts~3,000,000 Assertions
Represented in:• First Order Logic• Higher Order
Logic• Context Logic
(Micro-theories)
Thing
IntangibleThing Individual
TemporalThing
SpatialThing
PartiallyTangible
ThingPaths
SetsRelations
LogicMath
HumanArtifacts
SocialRelations,
Culture
HumanAnatomy &Physiology
EmotionPerception
Belief
HumanBehavior &
Actions
ProductsDevices
ConceptualWorks
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
SoftwareLiterature
Works of ArtLanguage
AgentOrganizations
OrganizationalActions
OrganizationalPlans
Types ofOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Law
Business &Commerce
PoliticsWarfare
ProfessionsOccupations
PurchasingShopping
TravelCommunication
Transportation& Logistics
SocialActivities
EverydayLiving
SportsRecreation
Entertainment
Artifacts
Movement
State ChangeDynamics
MaterialsParts
Statics
PhysicalAgents
BordersGeometry
EventsScripts
SpatialPaths
ActorsActions
PlansGoals
Time
Agents
Space
PhysicalObjects
HumanBeings
Organ-ization
HumanActivities
LivingThings
SocialBehavior
LifeForms
Animals
Plants
Ecology
NaturalGeography
Earth &Solar System
PoliticalGeography
Weather
Domain-Specific Knowledge
(e.g., Bio-Warfare, Terrorism, Computer Security, Military Tactics, Command & Control, Health Care, …)
Domain-Specific Facts and Data
Cyc Common-Sense Knowledge
1. Semantic Argument Constraints on Relations
Cyc contains many assertions of the form: (arg1Isa birthDate Animal)(arg2Isa capital City)
These represent that only animals have birthdays, and that the capital of a country must be a city.
These features of Cyc are a form of categorical knowledge.Although some of the categories invoked might seem
relatively specific and trivial compared to Aristotle’s, logically the constraining process is the same.
Cyc enforces these constraints at knowledge entry, a notable difference from every other formal ontology its size.
Cyc Common-Sense Knowledge
2. Disjointness Assertions Between Collections
ResearchCyc currently contains 6000 explicitly asserted disjointWith claims, for e.g.:
disjointWith Doorway WindowPortal
disjointWith HomogeneousStructure Ecosystem
disjointWith YardWork ShootingAProjectileWeapon (!)
From these countless further claims can be deduced.
Again, Cyc enforces these constraints at knowledge entry,
Wordnet knows about sibling relationships, so in some sense it knows that a cat cannot be a dog. But it cannot ramify this knowledge through its hierarchy in this way.
Cyc reasoning over its common-sense knowledge:
Never explicitly asserted into Cyc (!)
N.B.
Key assertion
Wikipedia as an ontology
articles basic concepts infoboxes facts about those concepts first sentences concept definitions, often in
standard format hyperlinks between articles ‘semantic
relatedness’ between concepts categories organise articles into conceptual
groupings. Though these groupings are far from a principled taxonomy enabling knowledge inheritance
For example, consider the following category….
!!
!
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
Exact mappings
CityOfSaarbrucken
Saarbrucken Saarbrucken
Saarbrücken
Get synonyms asserted in Cyc,
e.g. using #$nameString
matches
Use Wikipedia redirects
via Cyc synonymsStage A&B: easy 1-1 matches using title
strings (or synonyms)
Stage C: a number of possible candidates on the Wikipedia side, semantic disambiguation required
Collect all candidates (e.g. Kiwi: Bird? Fruit? Nationality? ) Compute commonness of each candidate (how often string
is anchor text to their Wikipedia pages)Collect context from Cyc (concepts nearby in taxonomy)Compute similarity to context (using Wikipedia hyperlinks:
Milne and Witten, 2008)Determine best candidate
Stage C: Many Wikipedia 1 Cyc. Semantic disambiguation required
Stage D: Many Cyc 1 Wikipedia. Reverse disambiguation required.
In this stage many candidate mappings were eliminated by mapping back from the Wikipedia article to the Cyc term, discarding mappings which don’t ‘map back’.
For example, Cyc term #$DirectorOfOrganisation incorrectly maps to Film director, but when we attempt to find a Cyc term from Film director we get #$Director-Film.
This reduced the number of mappings by 43%, but increased precision considerably.
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
First we found mapped concepts where the Wikipedia article had an equivalent category (about 20% of mapped concepts). E.g. the article Israeli Settlement has an equivalent category Israeli Settlements:
Finding new concepts in Wikipedia and adding them to Cyc
We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance:
We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance:
We called these ‘true children’.
We identified true children by:
1.Parsing the first sentences of Wikipedia articles (with a list of regular expressions):
Havat Gilad (Hebrew: ל:ע8ד<= =@ת ג lit. Gilad Farm) is an Israeli settlement ,ח@וoutpost in the West Bank.
Netiv HaGdud (Hebrew: דוד Fג Hיב ה Lת Fנ, lit. Path of the Battalion) is a moshav and Israeli settlement in the West Bank.
Kfar Eldad (Hebrew: כפר אלדד) is an Israeli settlement and a Communal settlement in the Gush Etzion Regional Council, south of Jerusalem.
The Yad La’achim operation (Hebrew: מבצע יד לאחים, “Giving hand to brothers") was an operation that the IDF performed during the disengagement plan.
2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it.
FAILS FIRST SENTENCE TEST
2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it.
FAILS FIRST SENTENCE TEST
Final Results
Method Cyc terms Percent mapped
Total terms available 163,000
Common sense terms 83,900
Exact (1-1) mappings 33,500 40%
Further mappings after disambiguation(2 ways)
8,800 10%
Total mapped 42,300 50%
Method Cyc assertions Percent growth
New Cyc Concepts 35, 000 30%
New “doc. strings” 17, 000
Other new assertions 228, 000 10%
Evaluation (22 human volunteers, online form, compared with DBPedia Ontology)
CASE: 1 2 3 4 5 6_ DBpedia children 0.58 0.81 0.99 0.98 0.99 0.99 Our New children 0.57 0.88 0.99 0.90 0.90 1.00
2008 mappings 0.65 0.83 0.99 0.99 0.99 1.00 2009 mappings 0.68 0.91 1.00 1.00 1.00 1.00CASES:1 : 100% of evaluators thought assignment correct2 : >50% thought assignment correct3 : At least 1 thought assignment correct4 : 100% thought assignment correct or close5 : >50% thought assignment correct or close6 : At least 1 thought assignment correct or close
1. Philosophical Categories: What are they? What are they good for?
2. Automated Ontology-Building• Concept Mapping• Adding New Assertions
3. Philosophical Quality Control: Semantic vs. Ontological Relatedness
Agenda
Due to its common-sense knowledge, Cyc ‘regurgitated’ many assertions fed to it which were ontologically incorrect.
Examples of rejected assertions:
(#$isa #$CallumRoberts #$Research)
Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS.
Why it happened
Why it happened
Quality control is provided via Cyc’s common-sense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect.
Examples of regurgitated assertions:
(#$isa #$CallumRoberts #$Research)
Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS.
Why it happened
Why it happened
Here Cyc knows that the collection of biological living objects is disjoint with the collection of information objects.
Quality control is provided via Cyc’s common-sense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect.
Examples of regurgitated assertions:
(#$isa #$CallumRoberts #$Research)
Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England.
(#$isa #$Insight-EMailClient #$EMailMessage)
Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS.
Why it happened
Why it happened
Here Cyc knows that the collection of software programs is disjoint with the collection of letters.
Here Cyc knows that the collection of biological living objects is disjoint with the collection of information objects.
Ontological vs Semantic RelatednessThese examples usefully highlight a clear difference
between quantitative measures of semantic relatedness, and an ontological relatedness derivable from a principled category structure.
Callum Roberts is a researcher, which is highly semantically related to research and Insight is an email client, which is highly semantically related to email messages.
Thematically these pairs are incredibly close, but ontologically, they are very different kinds of thing
In this way, Cyc rejected 4 300 assertions, roughly 3% of the total presented to it.
Manual analysis showed that of these, 96% were true negatives.
Future building workNow that we have a distinction between semantic and
ontological relatedness, combining the two has powerful possibilities.
In fact in general in automated information science, overlapping independent heuristics are a boon to accuracy.
We plan to: automatically augment Cyc’s disjointness network and semantic argument constraints on relations:systematically organized infobox relations are a natural
ground to generalize argument constraints. The Wikipedia category network will be mined – with
caution - for further disjointness knowledge. Evaluate much more fully both this automated ontology-
building effort and other current leaders in the field
Philosophical lessons• Our project suggests the notion of philosophical categories
leads to measurable improvements in real-world ontology-building.
• Just how extensive a system of categories should be will require real-world testing. But now we have the tools and free user-supplied data to do this.
• Where exactly the line should be drawn between categories proper and mere properties remains open.
• However, modern statistical tools raise the possibility of a quantitative treatment of ontological relatedness that is more nuanced than Aristotle’s ten neat piles of predicates, yet can still recognize that it is highly problematic to say that the number 8 is red, and why.