Catherine Legg & Sam Sarjant University of Waikato 2 July, 2012 Bill Gates is not a Parking Meter: Philosophical Quality Control in Automated Ontology-building

Catherine Legg & Sam SarjantUniversity of Waikato

2 July, 2012

Bill Gates is not a Parking Meter: Philosophical Quality Control in Automated

Ontology-building

1. Philosophical Categories: What are they? What are they good for?

2. Automated Ontology-Building• Concept Mapping• Adding New Assertions

3. Philosophical Quality Control: Semantic vs. Ontological Relatedness

Agenda




Agenda

What is wrong with this statement?

S1: The number 8 is a very red number.

This seems somehow worse than false.

For a false statement can be negated to produce a truth, but:

S2: The number 8 is not a very red number.

doesn’t seem right either.

The problem seems to be that numbers are not the kind of thing that can have colours. If someone thinks so then they don’t understand what kinds of things numbers are.

Traditional philosophical terminology for what is wrong is that S1 commits a category mistake.

Fregean quibbles notwithstanding

The odd synaesthesia episode notwithstanding

It is well-known that the philosophical discipline of ontology was invented by Aristotle to deal with these kinds of problems (among others):

όντος: (being) +

λόγος: (theory of)

It was called by him first philosophy (i.e. the fundamental science). A key part of its role was to define categories.

Categories are intended to describe the most different kinds of things that exist in reality: Suggested examples include:

Physical objects

Times

Events Relationships

Numbers

Traditionally ontologies were built into a taxonomic structure, or ‘tree of knowledge’:

‘Tree of Porphyry’ (3rd century A.D.)

• There is a subtle but important distinction between philosophical categories and mere properties.

• Although both divide entities into groups, and may be represented by classes, categories provide a deeper, more sortal division which enforces constraints.

• E.g. while we know the same thing cannot be a colour and a number, the same cannot be said for green and square.

• However, at what ‘level’ of an ontology constraining categorical divisions give way to non-constraining property divisions is frequently unclear and contested.

• This has led to skepticism about philosophical categories.

• Plus the disdain of “speculative metaphysics” by C20th logical positivism (and various successors) didn’t help.

Category vs. Property Distinctions:




Agenda

• A wealth of new free user-supplied Web content (‘Web 2.0’) for raw data. (Blogs, tags, Wikipedia…)

• New automated methods of determining semantic relatedness (e.g. Milne & Witten, “An Effective, Low-cost measure of Semantic Relatedness Obtained from Wikipedia links”,.2008)

• Manual methods widely agreed to be too slow and labour-intensive (e.g. Cyc project, 25 years, still unfinished)

Why Automated Ontology-Building?

What Has Been Done So Far?• YAGO (Wordnet backbone, enriched with Wikipedia’s leaf categories.

Taxonomic)

• DBPedia (Wikipedia Infoboxes and other semi-structured data. Not taxonomic)

• EMLR (Wikipedia category network, enriched with further relations from, e.g., parsing category names, e.g. BornIn1954. Taxonomic)

• Use as backbone the Cyc taxonomy, because:

• Highly principled as the result of so many years’ labor

• Purpose-built inference engine to reason over the knowledge

• Now publically available ‘open source’ (ResearchCyc)

• One researcher (Legg) had inside knowledge of the system, which is rare worldwide

• Build onto it knowledge mined from Wikipedia, because:

• Access to Wikipedia-based automated measure of semantic relatedness developed at University of Waikato

• Wikipedia is astounding!

• 2.4M (English) articles, referred to by 3M different terms

• ~25 hyperlinks per article

• 175K templates for semi-structured data-entry (inc.9K ‘infoboxes’)

• full editing history for every article….etc, etc.

Our Choice

Cyc Ontology & Knowledge Base

OpenCyc contains:~13,500 Predicates~200,000 Concepts~3,000,000 Assertions

Represented in:• First Order Logic• Higher Order

Logic• Context Logic

(Micro-theories)

Thing

IntangibleThing Individual

TemporalThing

SpatialThing

PartiallyTangible

ThingPaths

SetsRelations

LogicMath

HumanArtifacts

SocialRelations,

Culture

HumanAnatomy &Physiology

EmotionPerception

Belief

HumanBehavior &

Actions

ProductsDevices

ConceptualWorks

VehiclesBuildingsWeapons

Mechanical& Electrical

Devices

SoftwareLiterature

Works of ArtLanguage

AgentOrganizations

OrganizationalActions

OrganizationalPlans

Types ofOrganizations

HumanOrganizations

NationsGovernmentsGeo-Politics

Business, Military

Organizations

Law

Business &Commerce

PoliticsWarfare

ProfessionsOccupations

PurchasingShopping

TravelCommunication

Transportation& Logistics

SocialActivities

EverydayLiving

SportsRecreation

Entertainment

Artifacts

Movement

State ChangeDynamics

MaterialsParts

Statics

PhysicalAgents

BordersGeometry

EventsScripts

SpatialPaths

ActorsActions

PlansGoals

Time

Agents

Space

PhysicalObjects

HumanBeings

Organ-ization

HumanActivities

LivingThings

SocialBehavior

LifeForms

Animals

Plants

Ecology

NaturalGeography

Earth &Solar System

PoliticalGeography

Weather

Domain-Specific Knowledge

(e.g., Bio-Warfare, Terrorism, Computer Security, Military Tactics, Command & Control, Health Care, …)

Domain-Specific Facts and Data

Cyc Common-Sense Knowledge

1. Semantic Argument Constraints on Relations

Cyc contains many assertions of the form: (arg1Isa birthDate Animal)(arg2Isa capital City)

These represent that only animals have birthdays, and that the capital of a country must be a city.

These features of Cyc are a form of categorical knowledge.Although some of the categories invoked might seem

relatively specific and trivial compared to Aristotle’s, logically the constraining process is the same.

Cyc enforces these constraints at knowledge entry, a notable difference from every other formal ontology its size.

Cyc Common-Sense Knowledge

2. Disjointness Assertions Between Collections

ResearchCyc currently contains 6000 explicitly asserted disjointWith claims, for e.g.:

disjointWith Doorway WindowPortal

disjointWith HomogeneousStructure Ecosystem

disjointWith YardWork ShootingAProjectileWeapon (!)

From these countless further claims can be deduced.

Again, Cyc enforces these constraints at knowledge entry,

Wordnet knows about sibling relationships, so in some sense it knows that a cat cannot be a dog. But it cannot ramify this knowledge through its hierarchy in this way.

Cyc reasoning over its common-sense knowledge:

Never explicitly asserted into Cyc (!)

N.B.

Key assertion

Wikipedia as an ontology

articles basic concepts infoboxes facts about those concepts first sentences concept definitions, often in

standard format hyperlinks between articles ‘semantic

relatedness’ between concepts categories organise articles into conceptual

groupings. Though these groupings are far from a principled taxonomy enabling knowledge inheritance

For example, consider the following category….

!!

!




Agenda

Exact mappings

CityOfSaarbrucken

Saarbrucken Saarbrucken

Saarbrücken

Get synonyms asserted in Cyc,

e.g. using #$nameString

matches

Use Wikipedia redirects

via Cyc synonymsStage A&B: easy 1-1 matches using title

strings (or synonyms)

Stage C: a number of possible candidates on the Wikipedia side, semantic disambiguation required

Collect all candidates (e.g. Kiwi: Bird? Fruit? Nationality? ) Compute commonness of each candidate (how often string

is anchor text to their Wikipedia pages)Collect context from Cyc (concepts nearby in taxonomy)Compute similarity to context (using Wikipedia hyperlinks:

Milne and Witten, 2008)Determine best candidate

Stage C: Many Wikipedia 1 Cyc. Semantic disambiguation required

Stage D: Many Cyc 1 Wikipedia. Reverse disambiguation required.

In this stage many candidate mappings were eliminated by mapping back from the Wikipedia article to the Cyc term, discarding mappings which don’t ‘map back’.

For example, Cyc term #$DirectorOfOrganisation incorrectly maps to Film director, but when we attempt to find a Cyc term from Film director we get #$Director-Film.

This reduced the number of mappings by 43%, but increased precision considerably.




Agenda

First we found mapped concepts where the Wikipedia article had an equivalent category (about 20% of mapped concepts). E.g. the article Israeli Settlement has an equivalent category Israeli Settlements:

Finding new concepts in Wikipedia and adding them to Cyc

We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance:

We then mined this category for new concepts belonging under the mapped Cyc concept, according to the Cyc taxonomy. For instance:

We called these ‘true children’.

We identified true children by:

1.Parsing the first sentences of Wikipedia articles (with a list of regular expressions):

Havat Gilad (Hebrew: ל:ע8ד<= =@ת ג lit. Gilad Farm) is an Israeli settlement ,ח@וoutpost in the West Bank.

Netiv HaGdud (Hebrew: דוד Fג Hיב ה Lת Fנ, lit. Path of the Battalion) is a moshav and Israeli settlement in the West Bank.

Kfar Eldad (Hebrew: כפר אלדד) is an Israeli settlement and a Communal settlement in the Gush Etzion Regional Council, south of Jerusalem.

The Yad La’achim operation (Hebrew: מבצע יד לאחים, “Giving hand to brothers") was an operation that the IDF performed during the disengagement plan.

2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it.

FAILS FIRST SENTENCE TEST

2. “Infobox pairing” – If an article in the category shares an infobox template with 90% of true children, we include it.

FAILS FIRST SENTENCE TEST

Final Results

Method Cyc terms Percent mapped

Total terms available 163,000

Common sense terms 83,900

Exact (1-1) mappings 33,500 40%

Further mappings after disambiguation(2 ways)

8,800 10%

Total mapped 42,300 50%

Method Cyc assertions Percent growth

New Cyc Concepts 35, 000 30%

New “doc. strings” 17, 000

Other new assertions 228, 000 10%

Evaluation (22 human volunteers, online form, compared with DBPedia Ontology)

CASE: 1 2 3 4 5 6_ DBpedia children 0.58 0.81 0.99 0.98 0.99 0.99 Our New children 0.57 0.88 0.99 0.90 0.90 1.00

2008 mappings 0.65 0.83 0.99 0.99 0.99 1.00 2009 mappings 0.68 0.91 1.00 1.00 1.00 1.00CASES:1 : 100% of evaluators thought assignment correct2 : >50% thought assignment correct3 : At least 1 thought assignment correct4 : 100% thought assignment correct or close5 : >50% thought assignment correct or close6 : At least 1 thought assignment correct or close




Agenda

Due to its common-sense knowledge, Cyc ‘regurgitated’ many assertions fed to it which were ontologically incorrect.

Examples of rejected assertions:

(#$isa #$CallumRoberts #$Research)

Professor Callum Roberts is a marine conservation biologist, oceanographer, author and research scholar in the Environment Department of the University of York in England.

(#$isa #$Insight-EMailClient #$EMailMessage)

Insight WebClient is an groupware E-Mail client from Bynari embedded on Arachne Web Browser for DOS.

Why it happened

Why it happened

Quality control is provided via Cyc’s common-sense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect.

Examples of regurgitated assertions:





Why it happened

Why it happened

Here Cyc knows that the collection of biological living objects is disjoint with the collection of information objects.

Quality control is provided via Cyc’s common-sense knowledge, as Cyc knows enough now to ‘regurgitate’ many assertions which are ontologically incorrect.

Examples of regurgitated assertions:





Why it happened

Why it happened

Here Cyc knows that the collection of software programs is disjoint with the collection of letters.

Here Cyc knows that the collection of biological living objects is disjoint with the collection of information objects.

Ontological vs Semantic RelatednessThese examples usefully highlight a clear difference

between quantitative measures of semantic relatedness, and an ontological relatedness derivable from a principled category structure.

Callum Roberts is a researcher, which is highly semantically related to research and Insight is an email client, which is highly semantically related to email messages.

Thematically these pairs are incredibly close, but ontologically, they are very different kinds of thing

In this way, Cyc rejected 4 300 assertions, roughly 3% of the total presented to it.

Manual analysis showed that of these, 96% were true negatives.

Future building workNow that we have a distinction between semantic and

ontological relatedness, combining the two has powerful possibilities.

In fact in general in automated information science, overlapping independent heuristics are a boon to accuracy.

We plan to: automatically augment Cyc’s disjointness network and semantic argument constraints on relations:systematically organized infobox relations are a natural

ground to generalize argument constraints. The Wikipedia category network will be mined – with

caution - for further disjointness knowledge. Evaluate much more fully both this automated ontology-

building effort and other current leaders in the field

Philosophical lessons• Our project suggests the notion of philosophical categories

leads to measurable improvements in real-world ontology-building.

• Just how extensive a system of categories should be will require real-world testing. But now we have the tools and free user-supplied data to do this.

• Where exactly the line should be drawn between categories proper and mere properties remains open.

• However, modern statistical tools raise the possibility of a quantitative treatment of ontological relatedness that is more nuanced than Aristotle’s ten neat piles of predicates, yet can still recognize that it is highly problematic to say that the number 8 is red, and why.

[email protected]

[email protected]

Documents

Catherine Legg & Sam Sarjant University of Waikato 2 July, 2012 Bill Gates is not a Parking Meter: Philosophical Quality Control in Automated Ontology-building