Upload
melissa-roberts
View
216
Download
2
Embed Size (px)
Citation preview
Dr. Douglas B. Lenat
, 7718 Wood Hollow Drive, Austin, TX 78731 Email: [email protected] Phone: 512-342-4001
Computers Computers versus versus Common SenseCommon Sense
• CCYC: YC: Software that partially understandsSoftware that partially understands
• The impact of that on the AIM dreamThe impact of that on the AIM dream
May 4, 2009
ELIZA (DOCTOR)
1965 Joe Weizenbaum, MIT
• Carl Rogers advocated reflectionPatient: “I swear a lot.”
“How do you feel about the fact that you swear a lot?”
• Counted a lot on anthropomorphism
• Count on patient-to-doctor respect/awe/…Patient: “My dog’s mother died recently.”
“Tell me more about your mother.”
Veneer of intelligence, not intelligence
May 2009
ALICE (40 years post-Eliza)
• JUDGE11[9:26:0]I WILL ASK YOU SOME SIMPLE QUESTIONS AND GIVE SIMPLE COMMANDS
• PROGRAM[9:26:0]Do you think your plan will succeed?
• JUDGE11[9:26:16]WHAT IS THE COLOR OF A BLUE CAR• PROGRAM[9:26:16]WHAT COLOR IS a blue car
• JUDGE11[9:26:35]WHERE IS SUE’S NOSE WHEN SUE IS IN HER HOUSE
• PROGRAM[9:26:35]Where it belongs. Try searching the Web.
May 2009
May 2009
May 2009
2009: Software is still Brittle
“How old was Martin van Buren when he was elected President of the U.S.?”
“Is the Space Needle taller than the Eiffel Tower?”
May 2009
Natural Language Understanding requires having lots of knowledge
1. The pen is in the box.
The box is in the pen.
2. The police watched the demonstrators because they feared violence.
The police watched the demonstrators because because they advocated violence.
3. Mary and Sue are sisters.
Mary and Sue are mothers.
4. Every American has a mother.
Every American has a president.
5. John saw his brother skiing on TV. The fool didn’t have a coat on!
John saw his brother skiing on TV. The fool didn’t recognize him!
7. “…include all the re-do CABG procedures utilizing ITA and SVG in 1991”.
“And” usually does mean “and”. But in this query, “and” really must mean “or”. Medical knowledge, not grammar, disambiguates this: a single CABG will not have both an ITA and a SVG.
8. “…that the tumor cells are stopping dividing or dying…”
Do they mean “stopping dividing or stopping dying”? Of course not, but in 16 of 30 randomly selected syntactically similar constructions from www.clinicaltrials.gov, the coordination (i.e., the wider scope of the modifier, in this case the word “stopping”) was the intended meaning. In each case, only one choice “makes sense” (is consistent with medical knowledge and common sense).
9. “Adult patients who underwent MAZE III with or without Mitral Valve Repair or Replacements.”
Is the second half of that query just a waste of space? Discourse pragmatics says no, the physician must have had some reason for saying that. Medical knowledge provides a plausible interpretation: “Adult patients who underwent MAZE III with no concomitant procedures other than Mitral Valve Repair or Replacements”
May 2009
May 20092 July 2005
The basic idea:
Get the computer to understand, not just store, information. Then it can
reason to answer your queries.
Okay, so let’s tell the computer the same sorts of things that human beings know about cars, and colors, heights, movies, time, driving to a place, etc. all the other stuff that everybody knows.
May 20092 July 2005
The basic idea:
Get the computer to understand, not just store, information. Then it can
reason to answer your queries.
MicrowaveOven is a type of Kitchen-Appliance
Dishwasher is a type of Kitchen-Appliance
May 20092 July 2005
Rthagide-disjaks is a type of Kitchen-Appliance
Gracinimumples is a type of Kitchen-Appliance
Rthagide-disjaks alorxes Vorawnistz.
Gracinimumples alorxes Vorawnistz and Buzqa.
Buzqa is a Thwarn and supplied through Epluns.
You can’t use X if it alorxes Y but lacks any Y
May 20092 July 2005
The basic idea:
Get the computer to understand, not just store, information. Then it can
reason to answer your queries.
Eventually, after writing millions of these rules, the system knows as much about pipes, liquids, water, electricity, microwave ovens, dishwashers, cars, colors, movies, heights, etc. as you and I do.
Ultimately, there is just 1 interpretation of that model, and it corresponds to the real world.
etc. all the other stuff that everybody knows.
Long before that, incrementally, the system gains competence and trustworthiness
May 2009
Cyc is…
– The typical bird has 1 beak, 1 heart, lots of feathers,…
– Hearts are internal organs; feathers are external protrusions
– Most vehicles are steered by an awake, sane, adult,… human
– Tangible objects can’t be in 2 (disjoint) places at once
– Badly injuring a child is much worse than killing a dog
– Causes temporally precede (i.e., start before) their effects
– A stabbing requires 2 cotemporal and proximate actors
– etc.
Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
- Each of these represented in formal logic- Info. about a set of hundreds of thousands of terms- Language-independent
PenitentiaryEnglishWord-Plume
EnglishWord-Pen
FrenchWord-Plume
…
WritingPen
BirdFeather
Authoring
ChineseWordForWritingPen
Cyc is…Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
May 2009
- Each of these represented in formal logic- Info. about a set of hundreds of thousands of terms
• An inference engine that produces the same sorts of inferences from those that people would.
• Interfaces so the system can communicate with people, data bases, spreadsheets, websites, etc.
Cyc is…Millions of facts, rules of thumb, etc. that capture human common sense about our everyday world
May 2009
• bits/bytes/streams/network…• alphabet, special characters,…• words, morphological variants,…• syntactic meta-level markups (HTML)• semantic meta-level markups (SGML, XML)• content (logical representation of doc/page/...)• context (common sense, recent utterances, and n
dimensions of metadata: time, space, level of granularity, the source’s purpose, etc.)
What Needs to be Shared?
Sem.Web
• Query: “Someone smiling”
• Caption: “A man helping his daughter take her first step”
find information
find information
by inference (+KB)
by inference (+KB)
When you become happy, you smile.
You become happy when someone you love accomplishes a milestone.
Taking one’s first step is a milestone.
Parents love their children.
.
How formalized knowledge helps search(ForAll ?P (ForAll ?C (implies (and (isa ?P Person) (children ?P ?C)) (loves ?P ?C))))
May 2009
May 2009
Query: “Show me pictures of strong and adventurous people”
Caption: “A man climbing a rock face”
find information
find information
by inference (+KB)
by inference (+KB)
How formalized knowledge helps search
May 2009
Text Document
Query: “Government buildings damaged in terrorist events in Beirut between 1990 and 2001”
Document: “1993 pipe bombing of France’s embassy in Lebanon.”
find information
find information
by inference (+KB)
by inference (+KB)
How formalized knowledge helps search
How can our programs be intelligent, not merely have the veneer of it?
• ANSWER: By having a large corpus of knowledge, spanning the gamut from specific domain-dependent all the way up to general common sense.
• The computer needs to be able to apply the knowledge, not just store some English gloss– Represent it formally (predicate calculus), and apply logic
– Represent it numerically, and apply mathematics/statistics
• And after all that: Be compelling to the human deciding
• Magic tricks– “How do they do that?!” “How was I ever fooled by that?!”
• Efficacy of punishment vs reward– “Punishment is more effective, and the statistics back me up”
• Clinical decision-making (by doctors and by patients)– “Because 0.814” versus “Because < plausible causal rationale >”
• Organ donation in European countries:– Why is it so often 15%/85% or 85%/15% ?
[Answer: Because when you apply for a drivers license in some countries, you have to check a box to “opt in”; in others, you have to check a box to “opt out”; and in the U.S. and most European countries at least, 85% of the people don’t know what they should do, even though it’s an emotional, serious choice, and end up just leaving it unchecked.]
• And after all that: Be compelling to the human deciding
One Good Explanation is worth 20 points of IQ
Reflection Framing EffectPhiladelphia is preparing for a Legionaire’s Disease outbreak expected to kill 600 people today. Two alternative programs to combat the disease have been proposed. The consequences of each program are as follows:
If Program A is adopted, 200 people will be saved. (72%)
If Program B is adopted, there is a 1/3 chance that all 600 will be saved, anda 2/3 chance that no liveswill be saved. (28%)
If Program A’ is adopted, 400 people will die. (22%)
If Program B ’ is adopted, there is a 2/3 chance that 600 will die, and a 1/3 chance that no one will die. (78%)
=
=
For more information, see: Kahneman, D. and Tversky, A. (1984). Choices, values, and frames. American Psychologist, 39, 341-350.
Conjunction Fallacy A health survey was conducted in a representative sample of adult males in
Chicago of all ages and occupations. Mr. F was included in the sample. He was selected by random chance from the list of participants.
Please rank the following statements in terms of which is most likely to be true of Mr. F. (1=more likely to be true, 6=least likely)
____ Mr. F smokes more than 1 cigarette per day on average.
____ Mr. F has had one or more heart attacks. A
____ Mr. F had a flu shot this year. A and B
____ Mr. F eats red meat at least once per week.
____ Mr. F has had one or more heart attacks and he is over 55 years old.
____ Mr. F never flosses his teeth.
For more information, see: Tversky, A. and Kahneman, D. (1983). Extensional vs. intui-tive reasoning: The conjunction fallacy in probability judgment. Psych.Rev. 90, 293-315.
58% rated “A and B” more likely than A
Why there is a need for meta-logical elements (rationale and POV) to convince decision-makers
• Early hominids: pre-rational decision-makers
• Later hominids: usually rational
• Even later hominids: almost always rational
A 67 year old woman suffering from ICM with elevated bilirubin, history of diabetes, body mass index of 39.5, NYHA function class III, mitral valve regurgitation grade (MVRG) of 2+, and no aortic valve regurgitation (AVR) is assigned to CABG surgery. RF+Cyc is consulted and the RF (random forest statistical reasoning) component, having been trained on a large database, identifies CABG alone as the most likely treatment option, citing an odds ratio of 2.6 over the next most favorable treatment, CABG+MVA. As rationale, the Cyc (AI) component observes that the low MVRG is atypical of MVA which is a surgical procedure typically reserved for patients with severe mitral regurgitation and thus the simpler CABG procedure is preferred. However, an intraoperative transesophageal echocardiogram (TEE) suggests MVRG is 3+. Based on this, the surgical team overrides the initial diagnosis without consultation, opting instead for CABG+MVA. The patient dies 3 days later from complications due to surgery.
In this setting, RF+Cyc, if consulted, could have alerted the heart team to additional data that might have swayed their decision, thus potentially saving a life. RF+Cyc would have noted that while an MVRG of 3+ is consistent with CABG+MVA, the odds favoring CABG only marginally decrease from 2.6:1 to 1.7:1 when MVRG is upstaged for this patient from 2+ to 3+, and that surgery under CABG alone offers a 20% increase in median survival compared to CABG+MVA. RF+Cyc could further argue that intraoperative MVRG can falsely appear to be upstaged due to altered hemodynamics in anesthetized patients. An Cyc-assisted semantic search of the recent literature reveals that transesophageal transthoracic echocardiograms (TTE) more reliably reflect the degree of mitral regurgitation than TEE. That (+co-morbidities) argues for just CABG.
May 2009
4 Pitfalls of Semantic Technology
• Ignorance-based: A small theory size (#terms, instances, rules)
• Static KB (massively tuned, optimized, cached ahead of time)
• Simple assertions (SAT constraints; propositional calculus; Horn clause logic; Description Logic; first order logic)
• 1 global context (no contradic.’s, tiny domain, simplified world)
May 2009
• Cyc is a power source, not a single application.Like oil, electricity, telephony, computers,… Cyc can spawn and sustain a knowledge utility
industry.
• It can cost-effectively underlie almost all apps.(Provide a common-sense layer to reduce brittleness when faced with unexpected inputs/situations)
• To apply Cyc, we extend its ontology, its KB, and possibly its suite of specialized reasoning modules
Applying Cyc
May 2009
"What sequences of events could lead to
the destruction of Hoover Dam?"
“Were there any attacks on targets of symbolic value to
Muslims since 1987 on a Christian holy day?"
CycCyc
Terrorism KnowledgeTerrorism Knowledge
ReasoningModules
ReasoningModulesCycCyc ReasoningModules
ReasoningModules
Cycorp Tools For:Ontology-Building,
-Browsing, -Editing, & Fact/Rule Entry
Domain Experts Scenario
GenerationExplanation Generation
Query Formulation
Scenario Generator
Explanation Generator
Query Formulator
Others’/GOTSAnalysis and Collaboration Components
Interface to Data Repositories
Border Crossings
HIDObserva-
tions
Travel Records
Credit Card
Records
GeopoliticalData
GlobalTerrain
Data
Weather Data
Satellite Intel
HUMINTMessages
INSData
MilitaryIntel
output ofCOTS Text ExtractionSystems
SIGINTMessageContent
AKB
The Analyst’s Knowledge Base
Relational DB “projection” of the AKB
CT Analyst
Terrorism Knowledge
GeneralKnowledgeTerrorism Knowledge
Base
Terrorism Knowledge
Base)Terrorism Knowledge
GeneralKnowledge
OWL &
May 2009
A more recent example
“What major US cities are particularly vulnerable to an anthrax attack?”
The answer is logically implied by data dispersed through several sources:
USGSGNISDB
AMVAKB
RAND R
UNFAODB
DTRACATS
DB
May 2009
“major US city” ?C is a U.S. City with >1M population
“particularly vulnerable to an anthrax attack” – the current ambient temperature at ?C is above freezing,
and– ?C has more than 100 people for each hospital bed,
and– the number of anthrax host animals near ?C exceeds 100k
“What major US cities are particularly vulnerable to an anthrax attack?”
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
state | name | type | county | state_fips | -------+-----------------------+-------+----------------+------------+ TX | Dallas | ppl | Dallas | 48 | MN | Hennepin County | civil | Hennepin | 27 | CA | Sacramento County | civil | Sacramento | 6 | AZ | Phoenix | ppl | Maricopa | 4 |
primary_lat | primary_long| elevation | population | status | ------------+-------------+-----------+------------+------------------+ 32.78333 | -96.8 | 463 | 1022830 | BGN 1978 1959 45.01667 | -93.45 | 0 | 1032431 | 38.46667 | -121.31667 | 0 | 1041219 | 33.44833 | -112.07333 | 1072 | 1048949 | BGN 1931 1900 1897
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
So how do we explain to our system that:
• row 1 of that table is “about” the city of Dallas, TX
• the population field of that table contains the numberof inhabitants of the city that that row is “about”
• here is exactly how to access tuples of that database
• that access will be fast, accurate, recent, complete
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• the population field of that table contains the numberof inhabitants of the city that that row is “about”
We provide the field encodings and decodings, some of which correspond to explicit fields like population, two-letter state codes, etc:
(fieldDecoding Usgs-Gnis-LS ?x (TheFieldCalled “population”) (numberOfInhabitants
(TheReferentOfTheRow Usgs-Gnis) ?x))
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• how to access tuples of that database We provide all the information needed for a JDBC connection script:
We assert, in the context (MappingMtFn Usgs-KS), all of these:
(passwordForSKS Usgs-KS "geografy")(portNumberForSKS Usgs-KS 4032)(serverOfSKS Usgs-KS "sksi.cyc.com")(sqlProgramForSKS Usgs-KS PostgreSQL)(structuredKnowledgeSourceName Usgs-KS "usgs")(subProtocolForSKS Usgs-KS "postgresql")(userNameForSKS "sksi")
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc.
We assert, in the context (MappingMtFn Usgs-KS):
(schemaCompleteExtentKnownForValueTypeInArg Usgs-Gnis-LSUSCitynumberOfInhabitants 1)
May 2009
The Geographic Names Information System (GNIS)
DB maintained by the US Geological Survey (USGS).
USGSGNISDB
• that access will be fast, accurate, recent, complete We provide meta-level assertions about the database, about each table of the database, about the completeness etc. of various kinds of data in the DB, etc.
We assert, in the context (MappingMtFn Usgs-KS):
(resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "state")) TheEmptySet 60.0)
(resultSetCardinality Usgs-Gnis-PS (TheSet (PhysicalFieldFn Usgs-Gnis-PS "primary_long") (PhysicalFieldFn Usgs-Gnis-PS "primary_lat") (PhysicalFieldFn Usgs-Gnis-PS "name")) (TheSet (PhysicalFieldFn Usgs-Gnis-PS "county") (PhysicalFieldFn Usgs-Gnis-PS "state")) 530.36)
May 2009
“major US city” U.S. City with >1M population
“particularly vulnerable to an anthrax attack” – the current ambient temperature at ?C is above freezing,
and– ?C has more than 100 people for each hospital bed,
and– the number of anthrax host animals near ?C exceeds 100k
“What major US cities are particularly vulnerable to an anthrax attack?”
Cyc knows that pullets are chickens, so don’t add those two numbers together!
May 2009
May 2009
May 2009
May 2009
May 2009
May 2009
“In what countries bordering Pakistan are there members of the ANVC?”
Even simple queries often require 1-4 reasoning stepsEven simple queries often require 1-4 reasoning steps
Each answer that CAE finds for this generally involves a 1-4-step (not 0-step) argument (reasoning chain):
E.g., for the answer “India”, the justification is:
• According to the web site ‘Inside Terrorism’, the ANVC’s headquarters has been in Garo Hills, India from the beginning of January, 1996 through today.
• If an organization’s HQ is in place x, then there are members of that organization in place x.
• If someone is in place x, they are in every super-region of x.
• India borders Pakistan.
Don
’t inclu
de P
rior &
Tacit
Kn
ow
led
ge
May 2009
The Cyc Knowledge Base
ThingThing
IntangibleThing
IntangibleThing IndividualIndividual
TemporalThing
TemporalThing
SpatialThing
SpatialThing
PartiallyTangible
Thing
PartiallyTangible
ThingPathsPaths
SetsRelations
SetsRelations
LogicMathLogicMath
HumanArtifactsHumanArtifacts
SocialRelations,
Culture
SocialRelations,
Culture
HumanAnatomy &Physiology
HumanAnatomy &Physiology
EmotionPerception
Belief
EmotionPerception
Belief
HumanBehavior &
Actions
HumanBehavior &
ActionsProductsDevices
ProductsDevices
ConceptualWorks
ConceptualWorks
VehiclesBuildingsWeapons
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
Mechanical& Electrical
Devices
SoftwareLiterature
Works of Art
SoftwareLiterature
Works of ArtLanguageLanguage
AgentOrganizations
AgentOrganizations
OrganizationalActions
OrganizationalActions
OrganizationalPlans
OrganizationalPlans
Types ofOrganizations
Types ofOrganizations
HumanOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Business, Military
Organizations
LawLaw
Business &CommerceBusiness &Commerce
PoliticsWarfarePoliticsWarfare
ProfessionsOccupationsProfessionsOccupations
PurchasingShopping
PurchasingShopping
TravelCommunication
TravelCommunication
Transportation& Logistics
Transportation& Logistics
SocialActivities
SocialActivities
EverydayLiving
EverydayLiving
SportsRecreation
Entertainment
SportsRecreation
Entertainment
ArtifactsArtifacts
MovementMovement
State ChangeDynamics
State ChangeDynamics
MaterialsParts
Statics
MaterialsParts
Statics
PhysicalAgents
PhysicalAgents
BordersGeometryBorders
Geometry
EventsScriptsEventsScripts
SpatialPaths
SpatialPaths
ActorsActionsActorsActions
PlansGoalsPlansGoals
TimeTime
AgentsAgents
SpaceSpace
PhysicalObjectsPhysicalObjects
HumanBeingsHumanBeings
Organ-izationOrgan-ization
HumanActivitiesHuman
Activities
LivingThingsLivingThings
SocialBehaviorSocial
Behavior
LifeFormsLife
Forms
AnimalsAnimals
PlantsPlants
EcologyEcology
NaturalGeography
NaturalGeography
Earth &Solar System
Earth &Solar System
PoliticalGeography
PoliticalGeography
WeatherWeather
General Knowledge about Various DomainsGeneral Knowledge about Various Domains
Cyc contains:15,000 Predicates
500,000 Concepts5,200,000 Assertions
Represented in:• First Order Logic• Higher Order
Logic• Context Logic• Micro-theories
Specific data, facts, and observationsSpecific data, facts, and observations
These numbers are not a good way to really get a handle on the Cyc KB
May 2009
Cyc contains:15,000 Predicates
500,000 Concepts5,200,000 Assertions
These numbers are not a good way to really get a handle on the Cyc KB
The Cyc Knowledge Base
“Is any seagull also a moose?”
If Cyc knows 10,000 kinds of animals, it should be able to answer 100,000,000 queries like that.
Option 1: Add those 100M assertions to the KB
Option 2: Add 50M disjointWith assertions instead
Option 3: Add about 10k Linnaean taxonomy assertions to the KB, plus one extra assertion: (isa BiologicalTaxon SiblingDisjointCollectionType)
If taxons A and B are not explicitly known (via those 10k assertions) to be in a subset/superset relationship, then assume that they are disjoint.
A few hundred such SiblingDisjoint assertions take the place of over 6 billion disjointness ones…which in turn take the place of 100 trillion ones like this: (not (isa Cher Moose))
E.g., Cyc’s 5M axioms are divided into thousands of contexts by:
granularity, topic, culture, geospatial place, time,...
There is no one correct monolithic ontology.
There is a correct monolithic reasoning mechanism, but it is so deadly slow that we never call on it unless we have to
E.g., the Cyc inference engine is a community of 1000 “agents” that attack every problem and, recursively, every subproblem (subgoal). One of these 1000 is a general theorem prover; the others have special-purpose data structures/algorithms to handle the most important, most common cases, very fast.May 2009
May 2009
What factors argue <for/against> the conclusion that <ETA> <performed> <the March 2004 Madrid attacks>?
For:- ETA often executes attacks near national election- ETA has performed multi-target coordinated attacks- Over the past 30 years, ETA performed 75% of all terrorist attacks in Spain- Over the past 30 years, 98% of all terrorist attacks in Spain were performed by Spain-based groups, and ETA is a Spain-based group.
Against:-ETA warns (a few minutes ahead of time) of attacks that would result in a high number civilian casualties, to prevent them. There was no such warning prior to this attack.-ETA generally takes responsibility for its attacks, and it did not do so this time.-ETA has never been known to falsely deny responsibility for an attack, and it did deny responsibility for this attack.
May 2009
Building Cyc qua Engineering Task
amount known
rate
of
lear
ning
learning by discovery
learning via
natural language
CYC
900 person-years
23 realtime years
$90 million
Frontier of human knowledge
198
4
200
4to
day
codify & enter each piece of knowledge, by hand
May 2009
May 2009
Temporal Relations
37 Relations Between Temporal Things
temporalBoundsIntersect
temporallyIntersects
startsAfterStartingOf
endsAfterEndingOf
startingDate
temporallyContains
temporallyCooriginating
temporalBoundsContain
temporalBoundsIdentical
startsDuring
overlapsStart
startingPoint
simultaneousWith
after
May 2009
Temporal Relations
“Ariel Sharon was in Jerusalem during 2005 with granularity calendar-week”
“Condoleezza Rice made a ten-day trip to Jerusalem in February of 2005”
Both of them were in Jerusalem during February 2005
May 2009
• Rather than struggling to reason in natural language sentences, use
logic as the representation language.
• Most knowledge is default; reason by argumentation
• Rather than striving in vain for a single fast inference engine, use a suite of 1000+ heuristic modules that each handles a class of commonly-occurring problems very fast. [EL HL split]
• Some of these HL modules act as tacticians (meta-reasoners) to guide the reasoning; a few are strategists (meta-meta-reasoners)
• Bridging the knowledge gap: do the “intermediate theories.”
• Probabilities / certainty factors are useful (risk: overdependence)
• Rather than striving in vain for a monolithic consistent KB, divide the KB up into many locally-consistent contexts
Lessons LearnedLessons Learned
May 2009
Each assertion should be situated in a context: in a region of context-space
• We identified 12 dimensions of mt-space
• We developed a vocabulary of predicates and terms to describe points and regions along each of those 12 dimensions; and
• We have been situating assertions more and more precisely, and we have been working out calculi for inferring contexts
– E.g., if P is true in C1, and P=>Q is true in C2, in what context C2 can Q be validly concluded?
• Anthropacity• Time• GeoLocation• TypeOfPlace• TypeOfTime• Culture• Sophistication/Security• Topic• Granularity• Modality/Disposition
/Epistemology• Argument-Preference• Justification
May 2009
Mathematical Factoring of Context-space Dimensions
UnitedStatesIn1985Context: Ronald Reagan is president.
PennsylvaniaIn1985Context: Dick Thornburgh is governor.
LehighCountyInFebruary1985Context: Dick Thornburgh is governor and Ronald
Reagan is president.
This inference depends
on the time, space, and
respective granularities
of the contexts.
There are at least 900,000 doctors.
Dick Thornburgh is governor and there
are at least 900,000 doctors.
May 2009
Time Indices and Granularities
But should remain noncommittal about:
Doug is talking, at 14:42:09 , on 4 May 2009.
Doug is talking, at 1400-1500, on 4 May 2009.
Doug is talking, at 14:42-14:47, on 4 May 2009.
Therefore Cyc should infer (as a default):
May 2009
Time Indices and Granularities
t = that two-hour interval
t’ = a continuous 15-min. sub-interval
Futuret t’
So: Talking during each 15-minute interval? Yes
Talking during each 2-second interval: Unknown
Calendar Minutes
P = Doug is talking.
Doug is talking, at 14:00 to 15:00, on 4 May 2009 with temporal granularity 1 calendar minute
Past|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
May 2009
performedBy causes-EventEvent objectPlaced objectOfStateChange outputsCreated inputsDestroyed assistingAgent beneficiary
fromLocation toLocation deviceUsed driverActor damages vehicle providerOfMotiveForce
transportees
Relations Between Relations Between an Event and its Participantsan Event and its Participants
Over 400 more.
May 2009
In In Our Geospatial OntologyIn In Our Geospatial Ontology
• We started in 1984 with just one binary predicate, “in”.
• in(X,Y) means the inner object X is spatially located in the region defined by the outer object Y.
• If I just tell you in(X,Y), and you aren’t told what X and Y are, then you (and Cyc) can’t answer questions like these: – From the outside of Y, can I see any part of X? – If I turn Y over and shake it, will X fall out? – Is there room to put more things in Y? – Is X actually a part of Y?
• Such failures led to our introducing new, more precise, more specialized versions of “in”. By now there are over 75 such predicates, organized in a graphical taxonomy.
May 2009
Propositional Attitudes Relations Between Agents and Propositions
• goals• intends• desires• hopes• expects• believes
• opinesThat• knowsThat• remembersThat• perceivesThat• seesThat• fearsThat
Most of these are modal; assertions using them go beyond 1st-order logic
May 2009
Represented in:• First Order Logic• Higher Order
Logic• Context Logic• Microtheories
Handcrafted Cyc KB
ThingThing
IntangibleThing
IntangibleThing IndividualIndividual
TemporalThing
TemporalThing
SpatialThing
SpatialThing
PartiallyTangible
Thing
PartiallyTangible
ThingPathsPaths
SetsRelations
SetsRelations
LogicMathLogicMath
HumanArtifactsHumanArtifacts
SocialRelations,
Culture
SocialRelations,
Culture
HumanAnatomy &Physiology
HumanAnatomy &Physiology
EmotionPerception
Belief
EmotionPerception
Belief
HumanBehavior &
Actions
HumanBehavior &
ActionsProductsDevices
ProductsDevices
ConceptualWorks
ConceptualWorks
VehiclesBuildingsWeapons
VehiclesBuildingsWeapons
Mechanical& Electrical
Devices
Mechanical& Electrical
Devices
SoftwareLiterature
Works of Art
SoftwareLiterature
Works of ArtLanguageLanguage
AgentOrganizations
AgentOrganizations
OrganizationalActions
OrganizationalActions
OrganizationalPlans
OrganizationalPlans
Types ofOrganizations
Types ofOrganizations
HumanOrganizations
HumanOrganizations
NationsGovernmentsGeo-Politics
NationsGovernmentsGeo-Politics
Business, Military
Organizations
Business, Military
Organizations
LawLaw
Business &CommerceBusiness &Commerce
PoliticsWarfarePoliticsWarfare
ProfessionsOccupationsProfessionsOccupations
PurchasingShopping
PurchasingShopping
TravelCommunication
TravelCommunication
Transportation& Logistics
Transportation& Logistics
SocialActivities
SocialActivities
EverydayLiving
EverydayLiving
SportsRecreation
Entertainment
SportsRecreation
Entertainment
ArtifactsArtifacts
MovementMovement
State ChangeDynamics
State ChangeDynamics
MaterialsParts
Statics
MaterialsParts
Statics
PhysicalAgents
PhysicalAgents
BordersGeometryBorders
Geometry
EventsScriptsEventsScripts
SpatialPaths
SpatialPaths
ActorsActionsActorsActions
PlansGoalsPlansGoals
TimeTime
AgentsAgents
SpaceSpace
PhysicalObjectsPhysicalObjects
HumanBeingsHumanBeings
Organ-izationOrgan-ization
HumanActivitiesHuman
Activities
LivingThingsLivingThings
SocialBehaviorSocial
Behavior
LifeFormsLife
Forms
AnimalsAnimals
PlantsPlants
EcologyEcology
NaturalGeography
NaturalGeography
Earth &Solar System
Earth &Solar System
PoliticalGeography
PoliticalGeography
WeatherWeather
Real World Domain KnowledgeReal World Domain Knowledge
Cyc contains:15,000 Predicates
500,000 Concepts5,200,000 Assertions
Specific cases, facts, details,…Specific cases, facts, details,…
The pump has been primed,
Use it as an inductive bias to power more automatic knowledge acquisition
May 2009
• Abu Sayyaf was founded in ___
• Al Harakat Islamiya, established in ___
• ASG was established in ___
Search Strings
Abu Sayyaf was founded in the early 1990s
Parse
(foundingDate AbuSayyaf (EarlyPartFn (DecadeFn 199)))
(foundingDate AbuSayyaf ?X)
AKA by Shallow Fishing
Automated Knowledge Acquisition
May 2009
• The height of the Eiffel Tower is ___
• The Eiffel Tower is ___ tall
Search Strings
(height EiffelTower ?x)
AKA by Shallow Fishing
Automated Knowledge Acquisition
The height of the Eiffel Tower is 36 feet
The height of the Eiffel Tower is 984 feet Parse
(height EiffelTower (Foot 36))
(height EiffelTower (Foot 984))
WWW.CYC.COM
May 2009
May 2009
May 2009
May 2009
Recent/Future AKB Directions
• Make it comprehensive (13% 100%); apply it to other dom.• Make it easier for SME’s to enter/vet/modify info.• Improve the automatic acquis. (parsing / fishing from unstructured texts;
SKSI to structured sources, incl. SPARQL)• Make it easier for end users to pose questions:
– Automatically select (a small superset of) the relevant fragments– Use semantic constraints (argIsa, disjointness, domain knowledge…) to
combine the relevant fragments into a meaningful logical query
• Make justifications more terse and more compelling• Speed up inference (in general; and for AKB entry and AKB query-answering)
• Graceful degradation [½-way betw. QA & Google] falling back on Semantic Search of auto. tagged documents (tagged with Cyc terms)
CYC
May 2009
• Extend Cyc’s KB– Augment its ontology– New assertions involving those new terms
• New Heuristic Level modules– Identify the need(s) for them– Design, build, and debug them
• New interface modules– For manual entry; for SKSI mapping; for end users– Domain-specific interfaces (e.g., sketching military
unit movements; drawing chemical formulae; etc.)
Developing a Cyc App.
May 2009
OpenCycOpen Source release of: [most of] the Cyc
Ontology + Simple Relns. + Inference Engine
ResearchCycAlmost All of Cyc (for free for R&D purposes)
The OntologyThe OntologyPre-existing general medical knowledge frameworkPrior to the CCF project, Cyc’s KB had184 specializations of MedicalCareEvent:
MedicalCareEvent
Ablation
Ligation CoronaryArteryBypassGraft Biopsy-SurgicalProcedure TrephiningSomeone Prostatectomy
RoboticSurgery OutpatientSurgery InpatientSurgery LiposuctionSurgery RemovalOfUniqueBodyPart Appendectomy
…
Tonsillectomy
GumSurgery
SurgicalTreatment TransplantSurgery HeartTransplantSurgery GeneralSurgery
MajorSurgery
OpenHeartSurgery RootCanalSurgery VaccinationEvent BoosterVaccinationEvent AnthraxMilitaryVaccinationScript
MedicalTesting
…
The OntologyThe OntologyPre-existing general medical knowledge frameworkPrior to the CCF project, Cyc’s KB had 350+ specializations of AilmentCondition:
AttentionDeficitDisorder Glaucoma SpinalStenosis SleepDeprivation Ache-AilmentCondition Migraine Hemorrhaging-TheCondition Jaundice ParasiticAilment BacillaryAngiomatosis Cryptosporidiosis Rickettsiosis EpidemicTyphus-NAmerica ArthropodInfestation ExternalArthropodInfestation InternalArthropodInfestation Trichinosis Schistosomiasis Ascariasis BladderFlukeInfestation
…
Atherosclerosis MultiplePersonalityDisorder Adenomyosis Scabies AmyotrophicLateralSclerosis Scoliosis Hypoglycemia TemproMandibularJointSyndrome AcetylcholinePoisoning CadmiumPoisoning CarbonMonoxidePoisoning FoodborneBotulism InhalationalBotulism WoundBotulism InfantBotulism Endometriosis Neuralgia Sciatica Diverticulitis Gout MacularDegeneration
…
The OntologyThe OntologyPre-existing general medical knowledge frameworkPrior to the CCF project, Cyc’s KB had 200+ specializations of Bacterium:
StreptococcusPneumoniae StreptococcusPyogenes
Bacillaceae-Family
Bacillus-Genus
BacillusCereus-Species
Monotrichous
Bacterium-Monotrichous
Peritrichous
Bacterium-Peritrichous
Amphitrichous
Bacterium-Amphitrichous
Tenericutes-Division
Mollicutes-Class
Anaeroplasmataceae-Family
…
Asteroplasma-Genus
Acholeplasmatales-Order Acholeplasmataceae-Family Acholeplasma-Genus
Phytoplasma-Genus
Eperythrozoon-Genus
Mycoplasmatales-Order Mycoplasmataceae-Family
Mycoplasma-Genus MycoplasmaPneumoniae-Species Spirillales-Order
Vibrionaceae-Family
Vibrio-Genus
VibrioCholerae-Species
…
The OntologyThe OntologyHundreds of pre-existing relevant relationships
General Role Predicates:
objectActedOn
eventOccursAt
dateOfEvent
objectPlaced
objectRemoved
deviceUsed
…
Medical domain specific relations:
infectionCausedByOrganism
infectingPathogen
patientTreated
deviceTypeTreatsConditionType
causeOfDeathTypeOfType
formOfDisease ailmentTypeAffects ailmentEpidemicType ailmentAcquiredBy ailmentTypicallyAcquiredBy indicatedDrug mortalityRiskForCondition survivalRate riskOfInfectionFromTypeToType
…
The OntologyThe OntologyMethodology
• Establish bridging (translation) rules• Define rules that allow users to associate patients, dates, locations, etc. with the various events – e.g. define patientTreated as a relationship between a medical event and a patient.• Define rules that allow users to easily express complicated logical conditions – e.g. the defining rules for PrimarySurgery, isolatedProcedureOfType, concomitantProcedures, etc. • Define concise vocabulary for constructions that are complicated or difficult to express – e.g. “aortic valve replacement’ is represented as a single non-atomic term. This allows the user to specify this very common procedure with a single fragment instead of three distinct fragments in the CCF ontology (which in turn came about due to there not being an explicit functional term composition construct in the CCF representation).
Typical Query for outcomes study The examples in this presentation were short, simple, “Medical English” queries; the ones being focused on while building the
application, and now that it is actually being used at CCF, are much larger ones, e.g.:
IDENTIFY PATIENT POPULATION:
• FIND all native aortic valve replacements performed at CCF between January 1, 2000 and December 31, 2004 with a pre-operative diagnosis, as determined by echocardiogram, of moderately severe or severe aortic stenosis and moderate to severe left ventricular impairment.
• INCLUDE operations in which concomitant primary CABG or concomitant mitral or tricuspid valve repair was performed.
• EXCLUDE all patients with any prior valve repair or replacement; or with concomitant pulmonary valve repair; or with concomitant mitral, tricuspid, or pulmonary valve replacement; or with aortic regurgitation greater than moderate degree.
Researchers and clinicians sometimes ask the same queries
“Are there cases in the last decade where patients had pericardial aortic valves inserted in the reverse position, to serve as mitral valve replacements, and how often in such cases did endocarditis or tricuspid valve infection develop, and how long after the procedure?”
May 2009
77
• Get a large set of use-cases (CCF task: the last 900 queries)
• Arrange them into maximally mutually-dissimilar classes
• Manually represent a couple from each of those buckets– Reveals most of the necessary new predicates (+ interfaces)
• Now go through each of the use-cases, trolling for new domain-specific terms to add to the ontology
– Can be done manually, but we are beginning to rely more on semi-automatic methods where the system itself helps with that process
– As appropriate, lexify the terms and/or align them to existing standards
• Run exemplars from each bucket (i.e., to completion)– tracer bullets to reveal nec. new rules, reasoning modules (+interfaces)
• Replace the largest bucket by 2-4 spec.’s, recur (i.e., repeat the preceding 3 steps, and this one, again) until there is no new gain
78
• Test the system on previously-unseen use-cases (or at least ones which were
not among those previously-selected from their bucket)
• Have users try to use the system, and watch them (their results, of course, but also to the extent possible their time-feature trajectory)– Which features did they rarely or never use (to good effect)?
– Which features did they make heavy use of?
– Independent of this, ask them for their feedback and suggestions
– Try to identify classes of users which will translate into classes of documentation and training materials/regimes/interface specifics
• All along, identify what elements of the ontology (if any) are proprietary, and assimilate everything else into future versions of OpenCyc and ResearchCyc
May 2009
(implies
(and (cCFhasLeftAtriumDiameter ?EVT ?D) (greaterThan ?D ((Centi Meter) 3.8)) (patientTreated ?EVT ?PAT) (patientSex ?PAT FemaleHuman) (rdf-type ?EVT ?TYPE) (genls ?TYPE CCF-Evaluation)) (isa ?EVT EvaluationThatIndicates-
LeftAtrialEnlargement))
1784 pieces of pre-existing (prior to this project) Cyc KB knowledge used while handling a typical query. E.g.:
Inferred Disjointness constraints:(disjointWith PericardialWindow-SurgicalProcedure MedicalPatient)
Justification: [we are “counting” each of these assertions, in the total:](genls PericardialWindow-SurgicalProcedure PericardialProcedure-Surgical) in UniversalVocabularyMt(genls PericardialProcedure-Surgical CardiacProcedure-Surgical) in UniversalVocabularyMt(genls CardiacProcedure-Surgical SurgicalProcedure) in UniversalVocabularyMt(genls SurgicalProcedure MedicalCareEvent) in BaseKB(genls MedicalCareEvent PhysicalSituation) in BaseKB(genls PhysicalSituation Situation-Localized) in UniversalVocabularyMt(genls Situation-Localized Situation) in UniversalVocabularyMt(disjointWith SpatialThing-NonSituational Situation) in BaseKB(genls EnduringThing-Localized SpatialThing-NonSituational) in UniversalVocabularyMt(genls Agent-NonGeographical EnduringThing-Localized) in UniversalVocabularyMt(genls EmbodiedAgent Agent-NonGeographical) in UniversalVocabularyMt(genls PerceptualAgent-Embodied EmbodiedAgent) in UniversalVocabularyMt(genls Animal PerceptualAgent-Embodied) in UniversalVocabularyMt(genls MedicalPatient Animal) in UniversalVocabularyMt
Ideas for NLM Grand Challenges
• Comprehensive Ontology of Medicine– Ties to terminological standards (Snomed, ICD…), lexical ones (WordNet), conceptual ones (Cyc)
– Knowledge about/involving the concepts• Contextualized for time, source, level of detail,…
• Sample sub-project: multicultural Engl.-Engl. translation
• English-to-English “translation”– Using the above ontology of medicine, and models of discourse, models of classes of users (by age,
occupation, etc.), models of individual users (built up over time and stored HIPAA-securely)
– Translate articles, web pages, medicine bottle labels, etc. into comprehensible form for that user• In some cases this means literally writing more text expanding its length, or paring it down (eliminating prior knowledge)
• In less clear cases (where the user might or might not already know some piece of information), the best way to expand the original text might be to add footnotes containing the borderline information, and to pare down the original text by relegating borderline material to footnote form
– The translations needn’t just be static; they can sync with the user’s calendars, cell phones, computers, etc., to provide reminders, proactively send them relevant news articles or new warnings, and so on
• Automated Clinical/Biomedical Discovery– Hypothesis formation, Experiment design, Data gathering, Analysis, New terms&hypotheses
May 2009