DARPA HighPerformance Knowledge Bases Project

7/31/2019 DARPA HighPerformance Knowledge Bases Project

1/26

s Now completing its first year, the High-Perfor-

mance Knowledge Bases Project promotes technol-

ogy for developing very large, flexible, and

reusable knowledge bases. The project is supportedby the Defense Advanced Research Projects Agency

and includes more than 15 contractors in univer-sities, research laboratories, and companies. The

evaluation of the constituent technologies centers

on two challenge problems, in crisis managementand battlespace reasoning, each demanding pow-

erful problem solving with very large knowledge

bases. This article discusses the challenge prob-

lems, the constituent technologies, and their inte-

gration and evaluation.

Although a computer has beaten theworld chess champion, no computer hasthe commonsense of a six-year-old

child. Programs lack knowledge about theworld sufficient to understand and adjust tonew situations as people do. Consequently,programs have been poor at interpreting andreasoning about novel and changing events,such as international crises and battlefield sit-uations. These problems are more open endedthan chess. Their solution requires shallowknowledge about motives, goals, people, coun-tries, adversarial situations, and so on, as wellas deeper knowledge about specific politicalregimes, economies, geographies, and armies.

The High-Performance Knowledge Base(HPKB) Project is sponsored by the Defense

Advanced Research Projects Agency (DARPA)to develop new technology for knowledge-based systems.1 It is a three-year program, end-ing in fiscal year 1999, with funding totaling$34 million. HPKB technology will enabledevelopers to rapidly build very large knowl-edge baseson the order of 106 rules, axioms,or framesenabling a new level of intelligencefor military systems. These knowledge basesshould be comprehensive and reusable across

many applications, and they should be main-

tained and modified easily. Clearly, these goals

require innovation in many areas, from knowl-

edge representation to formal reasoning and

special-purpose problem solving, from knowl-edge acquisition to information gathering on

the web to machine learning, from natural lan-

guage understanding to semantic integrationof disparate knowledge bases.

For roughly one year, HPKB researchers have

been developing knowledge bases containing

tens of thousands of axioms concerning crisesand battlefield situations. Recently, the tech-

nology was tested in a month-long evaluation

involving sets of open-ended test items, most

of which were similar to sample (training)items but otherwise novel. Changes to the cri-

sis and battlefield scenarios were introduced

during the evaluation to test the comprehen-siveness and flexibility of knowledge in the

HPKB systems. The requirement for compre-hensive, flexible knowledge about general sce-

narios forces knowledge bases to be large. Chal-

lenge problems, which define the scenarios

and thus drive knowledge base development,are a central innovation of HPKB. This article

discusses HPKB challenge problems, technolo-

gies and integrated systems, and the evaluation

of these systems.The challenge problems require significant

developments in three broad areas of knowl-

edge-based technology. First, the overriding

goal of HPKBto be able to select, compose,extend, specialize, and modify components

from a library of reusable ontologies, common

domain theories, and generic problem-solving

strategiesis not immediately achievable andrequires some research into foundations of

very large knowledge bases, particularly

research in knowledge representation and

ontological engineering. Second, there is the

Articles

WINTER 1998 25

The DARPA High-Performance Knowledge

Bases ProjectPaul Cohen, Robert Schrag, Eric Jones, Adam Pease, Albert Lin,

Barbara Starr, David Gunning, and Murray Burke

Copyright 1998, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1998 / $2.00

A I M ag az in e V o l u m e 1 9 N u m b e r 4 ( 1 9 9 8 ) ( A A A I )


2/26

Often, one will accept an answer that isroughly correct, especially when the alterna-tives are no answer at all or a very specific butwrong answer. This is Lenat and Feigenbaumsbreadth hypothesis: Intelligent performance

often requires the problem solver to fall backon increasingly general knowledge, and/or toanalogize to specific knowledge from far-flungdomains (Lenat and Feigenbaum 1987, p.1173). We must, therefore, augment high-pow-er knowledge-based systems, which give spe-cific and precise answers, with weaker but ade-quate knowledge and inference. The inference

methods might not all be sound and complete.Indeed, one might need a multitude of meth-ods to implement what Polya called plausibleinference. HPKB encompasses work on a vari-ety of logical, probabilistic, and other infer-ence methods.

It is one thing to recognize the need for

commonsense knowledge, another to inte-grate it seamlessly into knowledge-based sys-

tems. Lenat observes that ontologies often aremissing a middle level, the purpose of which isto connect very general ontological conceptssuch as human and activitywith domain-specif-ic concepts such as the person who is responsiblefor navigating a B-52 bomber. Because HPKB isgrounded in domain-specific tasks, the focus ofmuch ontological engineering is this middlelayer.

The Participants

The HPKB participants are organized into threegroups: (1) technology developers, (2) integra-

tion teams, and (3) challenge problem devel-opers. Roughly speaking, the integration teamsbuild systems with the new technologies to

solve challenge problems. The integrationteams are led by SAIC and Teknowledge. Eachintegration team fields systems to solve chal-lenge problems in an annual evaluation. Uni-versity participants include Stanford Universi-ty, Massachusetts Institute of Technology(MIT), Carnegie Mellon University (CMU),

Northwestern University, University of Massa-chusetts (UMass), George Mason University(GMU), and the University of Edinburgh(AIAI). In addition, SRI International, the Uni-versity of Southern California Information Sci-ences Institute (USC-ISI), the Kestrel Institute,and TextWise, Inc., have developed importantcomponents. Information Extraction and

Transport (IET), Inc., with Pacific SierraResearch (PSR), Inc., developed and evaluatedthe crisis-management challenge problem, andAlphatech, Inc., is responsible for the battle-space challenge problem.

problem of building on these foundations topopulate very large knowledge bases. The goal

is for collaborating teams of domain experts(who might lack training in computer science)to easily extend the foundation theories,

define additional domain theories and prob-lem-solving strategies, and acquire domainfacts. Third, because knowledge is not enough,

one also requires efficient problem-solvingmethods. HPKB supports research on efficient,general inference methods and optimized task-

specific methods.HPKB is a timely impetus for knowledge-

based technology, although some might think

it overdue. Some of the tenets of HPKB werevoiced in 1987 by Doug Lenat and Ed Feigen-baum (Lenat and Feigenbaum 1987), and some

have been around for longer. Lenats CYC Pro-ject has also contributed much to our under-standing of large knowledge bases and ontolo-

gies. Now, 13 years into the CYC Project andmore than a decade after Lenat and Feigen-

baums paper, there seems to be consensus onthe following points:

The first and most intellectually taxing task

when building a large knowledge base is todesign an ontology. If you get it wrong, youcan expect ongoing trouble organizing the

knowledge you acquire in a natural way.Whenever two or more systems are built forrelated tasks (for example, medical expert sys-

tems, planning, modeling of physical process-es, scheduling and logistics, natural languageunderstanding), the architects of the systems

realize, often too late, that someone else hasalready done, or is in the process of doing, the

hard ontological work. HPKB challenges theresearch community to share, merge, and col-lectively develop large ontologies for signifi-

cant military problems. However, an ontologyalone is not sufficient. Axioms are required togive meaning to the terms in an ontology.

Without them, users of the ontology can inter-pret the terms differently.

Most knowledge-based systems have no

common sense; so, they cannot be trusted.Suppose you have a knowledge-based systemfor scheduling resources such as heavy-lift heli-

copters, and none of its knowledge concernsnoncombatant evacuation operations. Now,suppose you have to evacuate a lot of people.

Lacking common sense, your system is literallyuseless. With a little common sense, it couldnot only support human planning but might

be superior to it because it could think outsidethe box and consider using the helicopters inan unconventional way. Common sense is

needed to recognize and exploit opportunitiesas well as avoid foolish mistakes.

Articles

26 AI MAGAZINE


3/26

Challenge Problems

A programmatic innovation of HPKB is chal-lenge problems. The crisis-managementchal-lenge problem, developed by IET and PSR, isdesigned to exercise broad, relatively shallowknowledge about international tensions. Thebattlespace challenge problem, developed byAlphatech, Inc., has two parts, each designed

to exercise relatively specific knowledge aboutactivities in armed conflicts.Movement analysisinvolves interpreting vehicle movementsdetected and tracked by idealized sensors. Theworkaroundproblem is concerned with findingmilitary engineering solutions to traffic-obstruction problems, such as destroyedbridges and blocked tunnels.

Good challenge problems must satisfy sever-al, often conflicting, criteria. A challenge prob-lem must be challenging: It must raise the barfor both technology and science. A problemthat requires only technical ingenuity will nothold the attention of the technology develop-

ers, nor will it help the United States maintainits preeminence in science. Equally important,a challenge problem for a DARPA programmust have clear significance to the UnitedStates Department of Defense (DoD). Chal-lenge problems should serve for the durationof the program, becoming more challengingeach year. This continuity is preferable todesigning new problems every year becausethe infrastructure to support challenge prob-lems is expensive.

A challenge problem should require little orno access to military subject-matter experts. Itshould not introduce a knowledge-acquisitionbottleneck that results in delays and low pro-ductivity from the technology developers. Asmuch as possible, the problem should be solv-able with accessible, open-source material. Achallenge problem should exercise all (ormost) of the contributions of the technologydevelopers, and it should exercise an integra-tion of these technologies. A challenge prob-lem should have unambiguous criteria forevaluating its solutions. These criteria neednot be so objective that one can write algo-rithms to score performance (for example,human judgment might be needed to assess

scores), but they must be clear and they mustbe published early in the program. In addition,although performance is important, challengeproblems that value performance above all elseencourage one-off solutions (a solutiondeveloped for a specific problem, once only)and discourage researchers from trying tounderstand why their technologies work welland poorly. A challenge problem should pro-vide a steady stream of results, so progress can

be assessed not only by technology developersbut also by DARPA management and involvedmembers of the DoD community.

The HPKB challenge problems are designed

to support new and ongoing DARPA initiativesin intelligence analysis and battlespace infor-mation systems. Crisis-management systemswill assist strategic analysts by evaluating the

political, economic, and military courses ofaction available to nations engaged at variouslevels of conflict. Battlespace systems will sup-port operations officers and intelligence ana-

lysts by inferring militarily significant targetsand sites, reasoning about road network traffi-cability, and anticipating responses to militarystrikes.

Crisis-ManagementChallenge Problem

The crisis-management challenge problem isintended to drive the development of broad,relatively shallow commonsense knowledge

bases to facilitate intelligence analysis. Theclient program at DARPA for this problem isProject GENOACollaborative Crisis Under-standing and Management. GENOA is intendedto help analysts more rapidly understand

emerging international crises to preserve U.S.policy options. Proactive crisis managementbefore a situation has evolved into a crisis thatmight engage the U.S. militaryenables more

effective responses than reactive management.Crisis-management systems will assist strategicanalysts by evaluating the political, economic,and military courses of action available to

nations engaged at various levels of conflict.

The challenge problem development teamworked with GENOA representatives to identifyareas for the application of HPKB technology.

This work took three or four months, but thecrisis-management challenge problem specifi-cation has remained fairly stable since its ini-tial release in draft form in July 1997.

The first step in creating the challenge prob-lem was to develop a scenario to provide con-text for intelligence analysis in time of crisis.To ensure that the problem should require

development of real knowledge about theworld, the scenario includes real nationalactors with a fictional yet plausible story line.The scenario, which takes place in the PersianGulf, involves hostilities between Saudi Arabia

and Iran that culminate in closing the Strait ofHormuz to international shipping.

Next, IET worked with experts at PSR todevelop a description of the intelligence analy-

sis process, which involves the following tasks:information gatheringwhat happened, situ-ation assessmentwhat does it mean, and sce-

Articles

WINTER 1998 27


4/26

rent and a historical context. Crises can be rep-resented as events or as larger episodes tracking

the evolution of a conflict over time, frominception or trigger, through any escalation, toeventual resolution or stasis. The representa-tions being developed in HPKB are intended toserve as a crisis corporate memory to help ana-lysts discover historical precedents and analo-gies for actions. Much of the challenge-prob-lem specification is devoted to sample

questions that are intended to drive the devel-opment of general models for reasoning aboutcrisis events.

Sample questions are embedded in an ana-lytic context. The question What might hap-pen next? is instantiated as What might

happen following the Saudi air strikes? asshown in figure 1. Q51 is refined to Q83 in away that is characteristic of the analyticprocess; that is, higher-level questions arerefined into sets of lower-level questions thatprovide detail.

The challenge-problem developers (IET withPSR) developed an answer key for sample ques-tions, a fragment of which is shown in figure 2.Although simple factual questions (for exam-

ple, What is the gross national product of theUnited States?) have just one answer; ques-tions such as Q53 usually have several. Theanswer key actually lists five answers, two ofwhich are shown in figure 2. Each is accompa-nied by suitable explanations, including sourcematerial. The first source (Convention on theLaw of the Sea) is electronic. IET maintains aweb site with links to pages that are expected tobe useful in answering the questions. The sec-ond source is a fragment of a model developedby IET and published in the challenge-problemspecification. IET developed these fragments to

nario developmentwhat might happen next.Situation assessment (or interpretation)

includes factors that pertain to the specific sit-uation at hand, such as motives, intents, risks,rewards, and ramifications, and factors thatmake up a general context, or strategic cul-ture, for a state actors behavior in interna-tional relations, such as capabilities, interests,

policies, ideologies, alliances, and enmities.Scenario development, or speculative predic-tion, starts with the generation of plausibleactions for each actor. Then, options are eval-uated with respect to the same factors for situ-ation assessment, and a likelihood rating isproduced. The most plausible actions arereported back to policy makers.

These analytic tasks afford many opportuni-ties for knowledge-based systems. One is to useknowledge bases to retain or multiply corpo-rate expertise; another is to use knowledge andreasoning to think outside the box, to gener-ate analytic possibilities that a human analystmight overlook. The latter task requires exten-sive commonsense knowledge, or analystssense, about the domain to rule out implausi-

ble options.The crisis-management challenge problem

includes an informal specification for a proto-type crisis-management assistantto support ana-lysts. The assistant is tested by asking ques-tions. Some are simple requests for factualinformation, others require the assistant tointerpret the actions of nations in the contextof strategic culture. Actions are motivated byinterests, balancing risks and rewards. Theyhave impacts and require capabilities. Interestsdrive the formation of alliances, the exercise ofinfluence, and the generation of tensionsamong actors. These factors play out in a cur-

III. What of significance might happen following the Saudi air strikes?B. Options evaluationEvaluate the options available to Iran.

Close the Strait of Hormuz to shipping.Evaluation: ProbableMotivation: Respond to Saudi air strikes and deter future strikes.Capability:

(Q51) Can Iran close the Strait of Hormuz to international shipping?(Q83) Is Iran capable of firing upon tankers in the Strait of Hormuz? With what weapons?

Negative outcomes:(Q53) What risks would Iran face in closing the strait?

Articles

28 AI MAGAZINE

Figure 1. Sample Questions Pertaining to the Responses to an Event.


5/26

indicate the kinds of reasoning they would be

testing in the challenge problem.

For the challenge-problem evaluation, held

in June 1998, IET developed a way to generatetest questions through parameterization. Test

questions deviate from sample questions in

specified, controlled ways, so the teams partic-

ipating in the challenge problem know the

space of questions from which test items willbe selected. This space includes billions of

questions so the challenge problem cannot be

solved by relying on question-specific knowl-

edge. The teams must rely on general knowl-

edge to perform well in the evaluation.(Semantics provide practical constraints on thenumber of reasonable instantiations of para-

meterized questions, as do online sources pro-

vided by IET.) To illustrate, Q53 is parameter-

ized in figure 3. Parameterized question 53,

PQ53, actually covers 8 of the roughly 100

sample questions in the specification.Parameterized questions and associated class

definitions are based on natural language, giv-

ing the integration teams responsibility for

developing (potentially different) formal rep-

resentations of the questions. This decision

was made at the request of the teams. Aninstance of a parameterized question, say,

PQ53, is mechanically generated, then the

teams must create a formal representation and

reason with itwithout human intervention.

Battlespace Challenge Problems

The second challenge-problem domain forHPKB is battlespace reasoning.Battlespace is an

abstract notion that includes not only the

PQ53 [During/After ,] what {risks, rewards}would face in ?

={[exposure of its] {supporting, sponsoring}

in ,successful terrorist attacks against s

,(PQ51),taking hostage citizens of ,attacking targets with }

={terrorist group, dissident group, political party, humani-tarian organization}

Articles

WINTER 1998 29

Answer(s):1. Economic sanctions from {Saudi Arabia, GCC, U.S., U.N.}

The closure of the Strait of Hormuz would violate an international norm promoting freedom of the seas andwould jeopardize the interests of many states.

In response, states might act unilaterally or jointly to impose economic sanctions on Iran to compel it toreopen the strait.

The United Nations Security Council might authorize economic sanctions against Iran.

2. Limited military response from {Saudi Arabia, GCC, U.S., others}

Source(s):

The Convention on the Law of the Sea.

(B5) States may act unilaterally or collectively to isolate and/or punish a group or state that violates interna-tional norms. Unilateral and collective action can involve a wide range of mechanisms, such as intelligencecollection, military retaliation, economic sanction, and diplomatic censure/isolation.

Figure 2. Part of the Answer Key for Question 53.

Figure 3. A Parameterized Question Suitable for GeneratingSample Questions and Test Questions.


6/26

form and an order of battle that describes thestructure and composition of the enemy forces

in the scenario region.Given these input, movement analysis com-

prises the following tasks:

First is to distinguish military from nonmil-itary traffic. Almost all military traffic travels inconvoys, which makes this task fairly straight-

forward except for very small convoys of twoor three vehicles. Second is to identify the sitesbetween which military convoys travel, deter-

mine which of these sites are militarily signifi-cant, and determine the types of each militar-ily significant site. Site types include battle

positions, command posts, support areas, air-defense sites, artillery sites, and assembly-stag-ing areas.

Third is to identify which units (or parts ofunits) in the enemy order of battle are partici-pating in each military convoy.

Fourth is to determine the purpose of eachconvoy movement. Purposes include recon-

naissance, movement of an entire unit towarda battle position, activities by command ele-ments, and support activities.

Fifth is to infer the exact types of the vehi-cles that make up each convoy. About 20 typesof military vehicle are distinguished in the

enemy order of battle, all of which show up inthe scenario data.

To help the technology base and the integra-

tion teams develop their systems, a portion ofthe simulation data was released in advance ofthe evaluation phase, accompanied by an

answer key that supplied model answers foreach of the inference tasks listed previously.

Movement analysis is currently carried outmanually by human intelligence analysts,who appear to rely on models of enemybehavior at several levels of abstraction. These

include models of how different sites or con-voys are structured for different purposes andmodels of military systems such as logistics

(supply and resupply). For example, in a logis-tics model, one might find the following frag-

ment: Each echelon in a military organiza-tion is responsible for resupplying itssubordinate echelons. Each echelon, from bat-

talion on up, has a designated area for storingsupplies. Supplies are provided by higher ech-elons and transshipped to lower echelons at

these areas. Model fragments such as theseare thought to constitute the knowledge ofintelligence analysts and, thus, should be the

content of HPKB movement-analysis systems.Some such knowledge was elicited from mili-tary intelligence analysts during programwide

meetings. These same analysts also scriptedthe simulation scenario.

physical geography of a conflict but also theplans, goals, and activities of all combatantsprior to, and during, a battle and during theactivities leading to the battle. Three battle-space programs within DARPA were identifiedas potential users of HPKB technologies: (1) thedynamic multiinformation fusion program,(2) the dynamic database program, and (3) thejoint forces air-component commander(JFACC) program. Two battlespace challengeproblems have been developed.

The Movement-Analysis ChallengeProblem The movement-analysis challengeproblem concerns high-level analysis of ideal-ized sensor data, particularly the airborneJSTARS moving target indicator radar. ThisDoppler radar can generate vast quantities ofinformationone reading every minute foreach vehicle in motion within a 10,000-square-mile area.2 The movement-analysis sce-nario involves an enemy mobilizing a full divi-sion of ground forcesroughly 200 military

units and 2000 vehiclesto defend against apossible attack. A simulation of the vehiclemovements of this division was developed, theoutput of which includes reports of the posi-tions of all the vehicles in the division at 1-minute intervals over a 4-day period for 18hours each day. These military vehicle move-ments were then interspersed with plausiblecivilian traffic to add the problem of distin-guishing military from nonmilitary traffic. Themovement-analysis task is to monitor themovements of the enemy to detect and identi-fy types of military site and convoy.

Because HPKB is not concerned with signal

processing, the input are not real JSTARS databut are instead generated by a simulator andpreprocessed into vehicle tracks. There is nouncertainty in vehicle location and no radarshadowing, and each vehicle is always accu-rately identified by a unique bumper number.However, vehicle tracks do not precisely iden-tify vehicle type but instead define each vehi-cle as either light wheeled, heavy wheeled, ortracked. Low-speed and stationary vehicles arenot reported.

Vehicle-track data are supplemented bysmall quantities of high-value intelligencedata, including accurate identification of a fewkey enemy sites, electronic intelligence reports oflocations and times at which an enemy radar isturned on, communications intelligence reportsthat summarize information obtained by mon-itoring enemy communications, and humanintelligence reports that provide detailed infor-mation about the numbers and types of vehi-cle passing a given location. Other inputinclude a detailed road network in electronic

The secondchallenge-problem

domain forHPKB is

battlespacereasoning.

Battlespace is

an abstractnotion thatincludes not

only thephysical

geography ofa conflict but

also theplans, goals,and activities

of allcombatants

prior to, andduring, a

battle andduring the

activitiesleading to the

battle.

Articles

30 AI MAGAZINE


7/26

The Workaround Challenge ProblemThe workaround challenge problem supports air-campaign planning by the JFACC and his/herstaff. One task for the JFACC is to determinesuitable targets for air strikes. Good targets

allow one to achieve maximum military effectwith minimum risk to friendly forces and min-imum loss of life on all sides. Infrastructureoften provides such targets: It can be sufficient

to destroy supplies at a few key sites or criticalnodes in a transportation network, such asbridges along supply routes. However, bridgesand other targets can be repaired, and there is

little point in destroying a bridge if an avail-able fording site is nearby. If a plan requires aninterruption in traffic of several days, and thebridge can be repaired in a few hours, then

another target might be more suitable. Targetselection, then, requires some reasoning abouthow an enemy might work around the dam-age to the target.

The task of the workaround challenge prob-

lem is to automatically assess how rapidly andby what method an enemy can reconstitute orbypass damage to a target and, thereby, help

air-campaign planners rapidly choose effectivetargets. The focus of the workaround problemin the first year of HPKB is automaticworkaround generation.

The workaround task involves detailed rep-resentation of targets and the local terrainaround the target and detailed reasoning about

actions the enemy can take to reconstitute orbypass this damage. Thus, the input toworkaround systems include the following ele-ments:

First is a description of a target (for example,a bridge or a tunnel), the damage to it (forexample, one span of a bridge is dropped; thebridge and vicinity are mined), and key fea-

tures of the local terrain (for example, theslope and soil types of a terrain cross sectioncoincident with the road near the bridge,together with the maximum depth and the

speed of any river or stream the bridge crosses).Second is a specific enemy unit or capability

to be interdicted, such as a particular armoredbattalion or supply trucks carrying ammuni-

tion.Third is a time period over which this unit

or capability is to be denied access to the tar-geted route. The presumption is that the ene-

my will try to repair the damage within thistime period; a target is considered to be effec-tive if there appears to be no way for the ene-my to make this repair.

Fourth is a detailed description of the enemyresources in the area that could be used torepair the damage. For the most part, repairs to

battle damage are carried out by Army engi-neers; so, this description takes the form of a

detailed engineering order of battle.All input are provided in a formal represen-

tation language.

The workaround generator is expected toprovide three output: First is a reconstitution

schedule, giving the capacity of the damaged

link as a function of time since the damage wasinflicted. For example, the workaround gener-ator might conclude that the capacity of the

link is zero for the first 48 hours, but thereafter,a temporary bridge will be in place that cansustain a capacity of 170 vehicles an hour. Sec-

ond is a time line of engineering actions that theenemy might carry out to implement therepair, the time these actions require, and the

temporal constraints among them. If thereappears to be more than one viable repair strat-egy, a time line should be provided for each.

Third is a set of required assets for each time lineof actions, a description of the engineering

resources that are used to repair the damageand pointers to the actions in the time linethat utilize these assets. The reconstitution

schedule provides the minimal informationrequired to evaluate the suitability of a giventarget. The time line of actions provides an

explanation to justify the reconstitutionschedule. The set of required assets is easilyderived from the time line of actions and can

be used to suggest further targets for preemp-tive air strikes against the enemy to frustrate itsrepair efforts.

A training data set was provided to helpdevelopers build their systems. It supplied

input and output for several sample problems,together with detailed descriptions of the cal-culations carried out to compute action dura-tions; lists of simplifying assumptions made to

facilitate these calculations; and pointers totext sources for information on engineeringresources and their use (mainly Army field

manuals available on the World Wide Web).Workaround generation requires detailed

knowledge about what the capabilities of theenemys engineering equipment are and howit is typically used by enemy forces. For exam-

ple, repairing damage to a bridge typicallyinvolves mobile bridging equipment, such asarmored vehicle-launched bridges (AVLBs),

medium girder bridges, Bailey bridges, or floatbridges such as ribbon bridges or M4T6bridges, together with a range of earthmoving

equipment such as bulldozers. Each kind ofmobile bridge takes a characteristic amount oftime to deploy, requires different kinds of bank

preparation, and is owned by different eche-lons in the military hierarchy, all of which

The challengeproblems aresolved byintegratedsystemsfielded byintegrationteams led byTeknowledge

and SAIC.

Articles

WINTER 1998 31


8/26

object-oriented format (Pease and Carrico1997a, 1997b), and applications of this generic

semantics to domain-specific tasks are promis-ing (Pease and Albericci 1998). The develop-ment of ontologies for integrating manufactur-

ing planning applications (Tate 1998) andwork flow (Lee et al. 1996) is ongoing.

Another option for semantic integration is

software mediation (Park, Gennari, and Musen1997). This software mediation can be seen asa variant on pairwise integration, but because

integration is done by knowledge-basedmeans, one has an explicit expression of thesemantics of the conversion. Researchers at

Kestrel Institute have successfully defined for-mal specifications for data and used these the-ories to integrate formally specified software.

In addition, researchers at Cycorp have suc-cessfully applied CYC to the integration of mul-tiple databases.

The Teknowledge approach to integration isto share knowledge among applications and

create new knowledge to support the challengeproblems. Teknowledge is defining formalsemantics for the input and output of each

application and the information in the chal-lenge problems.

Many concepts defy simple definitions.

Although there has been much success indefining the semantics of mathematical con-cepts, it is harder to be precise about the

semantics of the concepts people use everyday. These concepts seem to acquire meaningthrough their associations with other con-

cepts, their use in situations and communica-tion, and their relations to instances. To give

the concepts in our integrated system realmeaning, we must provide a rich set of associ-ations, which requires an extremely largeknowledge base. CYC offers just such a knowl-edge base.

CYC (Lenat 1995; Lenat and Guha 1990)

consists of an immense, multicontextualknowledge base; an efficient inference engine;and associated tools and interfaces for acquir-

ing, browsing, editing, and combining knowl-edge. Its premise is that knowledge-based soft-ware will be less brittle if and only if it has

access to a foundation of basic commonsenseknowledge. This semantic substratum ofterms, rules, and relations enables application

programs to cope with unforeseen circum-stances and situations.

The CYC knowledge base represents millions

of hand-crafted axioms entered during the 13years since CYCs inception. Through carefulpolicing and generalizing, there are now

slightly fewer than 1 million axioms in theknowledge base, interrelating roughly 50,000

affect the time it takes to bring the bridge to adamage site and effect a repair. Because HPKBoperates in an entirely unclassified environ-ment, U.S. engineering resources and doctrinewere used throughout. Information fromArmy field manuals was supplemented by aseries of programwide meetings with an Armycombat engineer, who also helped constructsample problems and solutions.

Integrated Systems

The challenge problems are solved by integrat-ed systems fielded by integration teams led byTeknowledge and SAIC. Teknowledge favors acentralized architecture that contains a largecommonsense ontology (CYC); SAIC has a dis-tributed architecture that relies on sharing spe-cialized domain ontologies and knowledgebases, including a large upper-level ontologybased on the merging ofCYC, SENSUS, and otherknowledge bases.

Teknowledge Integration

The Teknowledge integration team comprisesTeknowledge, Cycorp, and Kestrel Institute. Itsfocus is on semantic integration and the cre-ation of massive amounts of knowledge.

Semantic Integration Three issues makesoftware integration difficult. Transportissuesconcern mechanisms to get data from oneprocess or machine to another. Solutionsinclude sockets, remote-method invocation(RMI), and CORBA. Syntacticissues concern howto convert number formats, syntactic sugar,and the labels of data. The more challenging

issues concern semantic integration: To integrateelements properly, one must understand themeaning of each. The database communityhas addressed this issue (Wiederhold 1996); itis even more pressing in knowledge-based sys-tems.

The current state of practice in software inte-gration consists largely of interfacing pairs ofsystems, as needed. Pairwise integration of thiskind does not scale up, unanticipated uses arehard to cover later, and chains of integratedsystems at best evolve into stovepipe systems.Each integration is only as general as it needsto be to solve the problem at hand.

Some success has been achieved in low-levelintegration and reuse; for example, systemsthat share scientific subroutine libraries orgraphics packages are often forced into similarrepresentational choices for low-level data.DARPA has invested in early efforts to createreuse libraries for integrating large systems athigher levels. Some effort has gone intoexpressing a generic semantics of plans in an

Teknowledgefavors a

centralizedarchitecture

that containsa large

commonsenseontology(CYC) .

Articles

32 AI MAGAZINE


9/26

atomic terms. Fewer than two percent of theseaxioms represent simple facts about propernouns of the sort one might find in analmanac. Most embody general consensusinformation about the concepts. For example,one axiom says one cannot perform volitionalactions while one sleeps, another says one can-not be in two places at once, and another saysyou must be at the same place as a tool to useit. The knowledge base spans human capabili-ties and limitations, including information onemotions, beliefs, expectations, dreads, andgoals; common everyday objects, processes,and situations; and the physical universe,including such phenomena as time, space,causality, and motion.

The CYC inference engine comprises an epis-temological and a heuristic level. The epistemo-logical level is an expressive nth-order logicallanguage with clean formal semantics. The

heuristic level is a set of some three dozen spe-cial-purpose modules that each contains its

own algorithms and data structures and canrecognize and handle some commonly occur-ring sorts of inference. For example, oneheuristic-level module handles temporal rea-soning efficiently by converting temporal rela-tions into a before-and-after graph and thendoing graph searching rather than theoremproving to derive an answer. A truth mainte-nance system and an argumentation-basedexplanation and justification system are tight-ly integrated into the system and are efficientenough to be in operation at all times. In addi-tion to these inference engines, CYC includesnumerous browsers, editors, and consistency

checkers. A rich interface has been defined.Crisis-Management Integration The cri-sis-management challenge problem involvesanswering test questions presented in a struc-tured grammar. The first step in answering atest question is to convert it to a form that CYCcan reason with, a declarative decision tree.When the tree is applied to the test questioninput, a CYC query is generated and sent to CYC.

Answering the challenge problem questionstakes a great deal of knowledge. For the firstyears challenge problem alone, the Cycorpand Teknowledge team added some 8,000 con-cepts and 80,000 assertions to CYC. To meet theneeds of this challenge problem, the team cre-ated significant amounts of new knowledge,some developed by collaborators and mergedinto CYC, some added by automated processes.

The Teknowledge integrated system in-cludes two natural language components:

START and TextWise. The START system was cre-ated by Boris Katz (1997) and his group at MIT.For each of the crisis-management questions,

Teknowledge has developed a template intowhich user-specified parameters can be insert-ed. START parses English queries for a few of thecrisis-management questions to fill in thesetemplates. Each filled template is a legal CYCquery. TextWise Corporation has been devel-oping natural language information-retrievalsoftware primarily for news articles (Liddy,Paik, and McKenna 1995). Teknowledgeintends to use the TextWise knowledge baseinformation tools (KNOW-IT) to supply manyinstances to CYC of facts discovered from newsstories. The system can parse English text andreturn a series of binary relations that expressthe content of the sentences. There are severaldozen relation types, and the constants thatinstantiate each relation are WORDNET synsetmappings (Miller et al. 1993). Each of the con-cepts has been mapped to a CYC expression,and a portion ofWORDNET has been mapped to

CYC. For those synsets not in CYC, the WORDNEThyponym links are traversed until a mapped

CYC term is found.Battlespace Integration Teknowledgesupported the movement-analysis workaroundproblem.

Movement analysis: Several movement-analysis systems were to be integrated, andmuch preliminary integration work was done.Ultimately, the time pressure of the challengeproblem evaluation precluded a full integra-tion. The MIT and UMass movement-analysissystems are described briefly here; the SMI andSRI systems are described in the section enti-tled SAIC Integrated Systems.

The MIT MAITA system provides tools for

constructing and controlling networks of dis-tributed-monitoring processes. These toolsprovide access to large knowledge bases ofmonitoring methods, organized around thehierarchies of tasks performed, knowledgeused, contexts of application, the alerting ofutility models, and other dimensions. Individ-ual monitoring processes can also make use ofknowledge bases representing commonsenseor expert knowledge in conducting their rea-soning or reporting their findings. MIT builtmonitoring processes for sites and convoyswith these tools.

The UMass group tried to identify convoysand sites with very simple rules. Rules weredeveloped for three site types: (1) battle posi-tions, (2) command posts, and (3) assembly-staging areas. The convoy detector simplylooked for vehicles traveling at fixed distancesfrom one another. Initially, UMass was goingto recognize convoys from their dynamics, inwhich the distances between vehicles fluctuatein a characteristic way, but in the simulated

SAIC has adistributedarchitecturethat relieson sharingspecializeddomainontologiesand

knowledgebases .

Articles

WINTER 1998 33


10/26

soning process. For this reason, a hierarchicaltask network (HTN) approach was taken. Aplanning-specific ontology was defined withinthe larger CYC ontology, and planning rulesonly referenced concepts within this moreconstrained context. The planning applicationwas essentially embedded in CYC.

CYC had to be extended to represent com-posite actions that have several alternativedecompositions and complex preconditions-effects. Although it is not a commonsenseapproach, AIAI decided to explore HTN plan-ning because it appeared suitable for theworkaround domain. It was possible to repre-sent actions, their conditions and effects, theplan-node network, and plan resources in arelational style. The structure of a plan wasimplicitly represented in the proof that thecorresponding composite action was a relationbetween particular sets of conditions andeffects. Once proved, action relations areretained by CYC and are potentially reusable.

An advantage of implementing the AIAI plan-ner in CYC was the ability to remove brittleness

from the planner-input knowledge format; forexample, it was not necessary to account for allthe possible permutations of argument orderin predicates such as bordersOn and between.

SAIC Integrated System

SAIC built an HPKB integrated knowledge envi-ronment (HIKE) to support both crisis-manage-ment and battlespace challenge problems. Thearchitecture ofHIKE for crisis management isshown in figure 4. For battlespace, the architec-

ture is similar in that it is distributed and relies

on the open knowledge base connectivity(OKBC) protocol, but of course, the compo-

nents integrated by the battlespace architectureare different. HIKEs goals are to address the dis-tributed communications and interoperabilityrequirements among the HPKB technologycomponentsknowledge servers, knowledge-acquisition tools, question-and-answering sys-tems, problem solvers, process monitors, and soonand provide a graphic user interface (GUI)tailored to the end users of the HPKB environ-ment.

HIKE provides a distributed computing infra-structure that addresses two types of commu-nications needs: First are input and outputdata-transportation and software connectivi-ties. These include connections between the

HIKE server and technology components, con-nections between components, and connec-tions between servers. HIKE encapsulates infor-mation content and data transportationthroughJAVA objects, hypertext transfer proto-col (HTTP), remote-method invocation (JAVA

data, the distances between vehicles remainedfixed. UMass also intended to detect sites by

the dynamics of vehicle movements betweenthem, but no significant dynamic patternscould be found in the movement data.

Workarounds: Teknowledge developed twoworkaround integrations, one an internalTeknowledge system, the other from AIAI at

the University of Edinburgh.Teknowledge developed a planning tool

based on CYC, essentially a wrapper around

CYCs existing knowledge base and inferenceengine. A plan is a proof that there is a pathfrom the final goal to the initial situation

through a partially ordered set of actions. Therules in the knowledge base driving the plan-ner are rules about action preconditions and

about which actions can bring about a certainstate of affairs. There is no explicit temporalreasoning; the partial order of temporal prece-

dence between actions is established on thebasis of the rules about preconditions and

effects.The planner is a new kind of inference

engine, performing its own search but in a

much smaller search space. However, each stepin the search involves interaction with theexisting inference engine by hypothesizing

actions and microtheories and doing asks andasserts in these microtheories. This hypothe-sizing and asserting on the fly in effect

amounts to dynamically updating the knowl-edge base in the course of inference; this capa-bility is new for the CYC inference engine.

Consistent with the goals of HPKB, theTeknowledge workaround planner reused CYCs

knowledge, although it was not knowledgespecific to workarounds. In fact, CYC had neverbeen the basis of a planner before, so even stat-

ing things in terms of an actions precondi-tions was new. What CYC provided, however,was a rich basis on which to build workaround

knowledge. For example, the Teknowledgeteam needed to write only one rule to state touse something as a device you must have con-

trol over that device, and this rule covered thecases of using an M88 to clear rubble, a mineplow to breach a minefield, a bulldozer to cut

into a bank or narrow the gap, and so on. Thereason one rule can cover so many cases isbecause clearing rubble, demining an area,

narrowing a gap, and cutting into a bank areall specializations of IntrinsicStateChange-Event, an extant part of the CYC ontology.

The AIAI workaround planner was alsoimplemented in CYC and took data fromTeknowledges FIRE&ISE-TO-MELD translator as its

input. The central idea was to use the scriptlikestructure of workaround plans to guide the rea-

Articles

34 AI MAGAZINE


11/26

RMI), and database access (JDBC). Second, HIKEprovides for knowledge-content assertion and

distribution and query requests to knowledge

services.The OKBC protocol proved essential. SRI

used it to interface the theorem prover SNARK to

an OKBC server storing the Central Intelli-gence Agency (CIA) World Fact Book knowledge

base. Because this knowledge base is large, SRIdid not want to incorporate it into SNARK but

instead used the procedural attachment fea-

ture ofSNARK to look up facts that were avail-

able only in the World Fact Book. MITs STARTsystem used OKBC to connect to SRIs OCELOT-

SNARK OKBC server. This connection will even-

tually give users the ability to pose questionsin English, which are then transformed to a

formal representation by START and shipped to

SNARK using OKBC; the result is returned using

OKBC. ISI built an OKBC server for their LOOM

system for GMU. SAIC built a front end to theOKBC server for LOOM that was extensivelyused by the members of the battlespace chal-lenge problem team.

With OKBC and other methods, the HIKEinfrastructure permits the integration of newtechnology components (either clients orservers) in the integrated end-to-end HPKB sys-tem without introducing major changes, pro-vided that the new components adhere to thespecified protocols.

Crisis-Management Integration TheSAIC crisis-management architecture isfocused around a central OKBC bus, as shownin figure 4. The technology components pro-vide user interfaces, question answering, andknowledge services. Some components haveoverlapping roles. For example, MITs START sys-tem serves both as a user interface and a ques-tion-answering component. Similarly, CMUs

Articles

WINTER 1998 35

SKC

ATP

SNARK

SPOOK

START

Ontolingua

Analyst

HIKEClients

GKB-Editor

webKB

TrainingData

WWW

Electronicarchivalsources

Real timenews feeds

KNOW-IT

WWW

Knowledge Engineer

OKBC(Open Knowledge

Base Connectivity)

BUS

Knowledge Services

UserI

nterfa

ce

HikeServers

QuestionAnswering

Question Answering

Figure 4. SAIC Crisis-Management Challenge Problem Architecture.


12/26

to reuse knowledge whenever it made sense.The SAIC team reused three knowledge bases:(1) the HPKB upper-level ontology developedby Cycorp, (2) the World Fact Book knowledgebase from the CIA, and the Units and MeasuresOntology from Stanford. Reusing the upper-level ontology required translation, compre-hension, and reformulation. The ontology wasreleased in MELD (a language used by Cycorp)and was not directly readable by the SAIC sys-tem. In conjunction with Stanford, SRI devel-oped a translator to load the upper-level ontol-ogy into any OKBC-compliant server. Onceloaded into the OCELOT server, the GKB editorwas used to comprehend the upper ontology.The graphic nature of the GKB editor illuminat-ed the interrelationships between classes andpredicates of the upper-level ontology. Becausethe upper-level ontology represents functionalrelationships as predicates but SNARK reasonsefficiently with functions, it was necessary toreformulate the ontology to use functions

whenever a functional relationship existed.Battlespace Integration The distributedHIKE infrastructure is well suited to support anintegrated battlespace challenge problem as itwas originally designed: a single informationsystem for movement analysis, trafficability,and workaround reasoning. However, the traffi-cability problem (establishing routes for variouskinds of vehicle given the characteristics) wasdropped, and the integration of the other prob-lems was delayed. The components that solvedthese problems are described briefly later.

Movement analysis: The movement-analy-sis problem is solved by MITs monitoring,

analysis, and interpretation tools arsenal (MAI-TA); Stanford Universitys Section on MedicalInformatics (SMI) problem-solving methods;and SRI Internationals Bayesian networks. TheMIT effort was described briefly in the sectionentitled Teknowledge Integration. Here, wefocus on the SMI and SRI movement-analysissystems.

For scalability, SMI adopted a three-layeredapproach to the challenge problem: The firstlayer consisted primarily of simple, context-free data processing that attempted to findimportant preliminary abstractions in the dataset. The most important of these were trafficcenters (locations that were either the startingor stopping points for a significant number ofvehicles) and convoy segments (a number ofvehicles that depart from the same traffic cen-ter at roughly the same time, going in roughlythe same direction). Spotting these abstractionsrequired setting a number of parameters (forexample, how big a traffic center is). Oncetrained, these first-layer algorithms are linear in

WEBKB supports both question answering andknowledge services.

HIKE provides a form-based GUI with whichusers can construct queries with pull-downmenus. Query-construction templates corre-

spond to the templates defined in the crisis-management challenge problem specification.Questions also can be entered in natural lan-

guage. START and the TextWise componentaccept natural language queries and thenattempt to answer the questions. To answer

questions that involve more complex types ofreasoning, START generates a formal representa-tion of the query and passes it to one of the

theorem provers.The Stanford University Knowledge Systems

Laboratory ONTOLINGUA, SRI Internationals

graphic knowledge base (GKB) editor, WEBKB,and TextWise provide the knowledge servicecomponents. The GKB editor is a graphic tool

for browsing and editing large knowledge bases,used primarily for manual knowledge acquisi-

tion. WEBKB supports semiautomatic knowledgeacquisition. Given some training data and anontology as input, a web spider searches in a

directed manner and populates instances ofclasses and relations defined in the ontology.Probabilistic rules are also extracted. TextWise

extracts information from text and newswirefeeds, converting them into knowledge inter-change format (KIF) triples, which are then

loaded into ONTOLINGUA. ONTOLINGUA is SAICscentral knowledge server and informationrepository for the crisis-management challenge

problem. ONTOLINGUA supports KIF as well ascompositional modeling language (CML). Flow

models developed by Northwestern University(NWU) answer challenge problem questionsrelated to world oil-transportation networks

and reside within ONTOLINGUA. Stanfords systemfor probabilistic object-oriented knowledge(SPOOK) provides a language for class frames to

be annotated with probabilistic information,representing uncertainty about the propertiesof instances in this class. SPOOK is capable of rea-

soning with the probabilistic information basedon Bayesian networks.

Question answering is implemented in sev-

eral ways. SRI Internationals SNARK and Stan-fords abstract theorem prover (ATP) are first-order theorem provers. WEBKB answers

questions based on the information it gathers.Question answering is also accomplished by

START and TextWise taking a query in English as

input and using information retrieval toextract the answers from text-based sources(such as the web, newswire feeds).

The guiding philosophy during knowledgebase development for crisis management was

The guidingphilosophyduring

knowledgebase develop-

ment for crisismanagementwas to reuse

knowledgewhenever it

made sense.

Articles

36 AI MAGAZINE


13/26

the size of the data set and enabled SMI to useknowledge-intensive techniques on the result-ing (much smaller) set of data abstractions.

The second layer was a repair layer, whichused knowledge of typical convoy behaviorsand locations on the battlespace to construct amap of militarily significant traffic and traf-fic centers. The end result was a network oftraffic connected by traffic. Three main tasksremain: (1) classify the traffic centers, (2) figureout what the convoys are doing, and (3) iden-tify which units are involved. SMI iterativelyanswered these questions by using repeatedlayers of heuristic classification and constraintsatisfaction. The heuristic-classification com-ponents operated independently of the net-work, using known (and deduced) facts aboutsingle convoys or traffic centers. Consider thefollowing rule for trying to identify a mainsupply brigade (MSB) site (paraphrased intoEnglish, with abstractions in boldface):

If we have a current site which is unclas-

sifiedand its in the Division support area,and the traffic is high enoughand the traffic is balancedand the site is persistent with no majordeployments emanating from it

then its probably an MSB

SMI used similar rules for the constraint-satis-faction component of its system, allowinginformation to propagate through the networkin a manner similar to Waltzs (1975) well-known constraint-satisfaction algorithm foredge labeling.

The goal of the SRI group was to induce aknowledge base to characterize and identifytypes of site such as command posts and battlepositions. Its approach was to induce aBayesian classifier and use a generative modelapproach, producing a Bayesian network thatcould serve as a knowledge base. This model-ing required transforming raw vehicle tracksinto features (for example, the frequency ofcertain vehicles at sites, number of stops) thatcould be used to predict sites. Thus, it was alsonecessary to have hypothetical sites to test. SRIrelied on SMI to provide hypothetical sites,and it also used some of the features that SMI

computed. As a classifier, SRI used tree-aug-mented naive (TAN) Bayes (Friedman, Geiger,and Goldszmidt 1997).

Workarounds: The SAIC team integratedtwo approaches to workaround generation,one developed by USC-ISI, the other by GMU.

ISI developed course-of-actiongenerationproblem solvers to create alternative solutionsto workaround problems. In fact, two alterna-tive course-of-actiongeneration problem solv-

ers were developed. One is a novel plannerwhose knowledge base is represented in the

ontologies, including its operators, statedescriptions, and problem-specific informa-tion. It uses a novel partial-match capability

developed in LOOM (MacGregor 1991). The oth-er is based on a state-of-the-art planner (Velosoet al. 1995). Each solution lists several engi-

neering actions for this workaround (for exam-ple, deslope the banks of the river, install atemporary bridge), includes information about

the sources used (for example, what kind ofearthmoving equipment or bridge is used), andasserts temporal constraints among the indi-

vidual actions to indicate which can be execut-ed in parallel. A temporal estimation-assess-ment problem solver evaluates each of the

alternatives and selects one as the most likelychoice for an enemy workaround. This prob-lem solver was developed in EXPECT (Swartout

and Gil 1995; Gil 1994).Several general battlespace ontologies (for

example, military units, vehicles), anchoredon the HPKB upper ontology, were used andaugmented with ontologies needed to reason

about workarounds (for example, engineeringequipment). Besides these ontologies, theknowledge bases used included a number of

problem-solving methods to represent knowl-edge about how to solve the task. Both ontolo-gies and problem-solving knowledge were used

by two main problem solvers.EXPECTs knowledge-acquisition tools were

used throughout the evaluation to detect miss-

ing knowledge. EXPECT uses problem-solvingknowledge and ontologies to analyze which

information is needed to solve the task. Thiscapability allows EXPECT to alert a user whenthere is missing knowledge about a problem

(for example, unspecified bridge lengths) or asituation. It also helps debug and refineontologies by detecting missing axioms and

overgeneral definitions.GMU developed the DISCIPLE98 system. DISCI-

PLE is an apprenticeship multistrategy learning

system that learns from examples, from expla-nations, and by analogy and can be taught byan expert how to perform domain-specific

tasks through examples and explanations in away that resembles how experts teach appren-tices (Tecuci 1998). For the workaround

domain, DISCIPLE was extended into a baseline-integrated system that creates an ontology byacquiring concepts from a domain expert or

importing them (through OKBC) from sharedontologies. It learns task-decomposition rulesfrom a domain expert and uses this knowledge

to solve workaround problems through hierar-chical nonlinear planning.

Articles

WINTER 1998 37


14/26

ferent metrics. The test items for crisis manage-

ment were questions, and the test was similar

to an exam. Overall competence is a function

of the number of questions answered correctly,

but the crisis-management systems are also

expected to show their work and provide

justifications (including sources) for their

answers. Examples of questions, answers, and

justifications for crisis management are shownin the section entitled Crisis-Management

Challenge Problem.

Performance metrics for the movement-

analysis problem are related to recall and pre-

cision. The basic problem is to identify sites,

vehicles, and purposes given vehicle track

data; so, performance is a function of how

many of these entities are correctly identified

and how many incorrect identifications are

made. In general, identifications can be

marked down on three dimensions: First, the

identified entity can be more or less like the

actual entity; second, the location of the iden-

tified entity can be displaced from the actualentitys true location; and third, the identifica-

tion can be more or less timely.

The workaround problem involves generat-

ing workarounds to military actions such as

bombing a bridge. Here, the criteria for suc-

cessful performance include coverage (the

generation of all workarounds generated),

appropriateness (the generation of work-

arounds appropriate given the action), speci-

ficity (the exact implementation of the work-

around), and accuracy of timing inferences

(the length each step in the workaround takes

to implement).Performance evaluation, although essential,

tells us relatively little about the HPKB inte-

grated systems, still less about the component

technologies. We also want to know why the

systems perform well or poorly. Answering this

question requires credit assignment because

the systems comprise many technologies. We

also want to gather evidence pertinent to sev-

eral important, general claims. One claim is

that HPKB facilitates rapid construction of

knowledge-based systems because ontologies

and knowledge bases can be reused. The chal-

lenge problems by design involve broad, rela-

tively shallow knowledge in the case of crisismanagement and deep, fairly specific knowl-

edge in the battlespace problems. It is unclear

which kind of problem most favors the reuse

claim and why. We are developing analytic

models of reuse. Although the predictions of

these models will not be directly tested in the

first years evaluation, we will gather data to

calibrate these models for a later evaluation.

First, with DISCIPLEs ontology-building tools,a domain expert assisted by a knowledge engi-neer built the object ontology from severalsources, including experts manuals, Alphate-chs FIRE&ISE document and ISIs LOOM ontology.

Second, a task taxonomy was defined by refin-ing the task taxonomy provided by Alphatech.This taxonomy indicates principled decomposi-tions of generic workaround tasks into subtasksbut does not indicate the conditions underwhich such decompositions should be per-formed. Third, the examples of hierarchicalworkaround plans provided by Alphatech were

used to teach DISCIPLE. Each such plan providedDISCIPLE with specific examples of decomposi-tions of tasks into subtasks, and the expert guid-ed DISCIPLE to understand why each taskdecomposition was appropriate in a particularsituation. From these examples and the expla-nations of why they are appropriate in the giv-

en situations, DISCIPLE learned general task-decomposition rules. After a knowledge base

consisting of an object ontology and task-decomposition rules was built, the hierarchicalnonlinear planner ofDISCIPLE was used to auto-matically generate workaround plans for newworkaround problems.

Evaluation

The SAIC and Teknowledge integrated systems

for crisis management, movement analysis,and workarounds were tested in an extensivestudy in June 1998. The study followed a two-phase, test-retest schedule. In the first phase,the systems were tested on problems similar to

those used for system development, but in thesecond, the problems required significantmodifications to the systems. Within each

phase, the systems were tested and retested onthe same problems. The test at the beginningof each phase established a baseline level ofperformance, but the test at the end measuredimprovement during the phase.

We claim that HPKB technology facilitatesrapid modification of knowledge-based sys-

tems. This claim was tested in both phases ofthe experiment because each phase allowstime to improve performance on test prob-lems. Phase 2 provides a more stringent test:Only some of the phase 2 problems can besolved by the phase 1 systems, so the systemswere expected to perform poorly in the test at

the beginning of phase 2. The improvement inperformance on these problems during phase2 is a direct measure of how well HPKB tech-nology facilitates knowledge capture, represen-tation, merging, and modification.

Each challenge problem is evaluated by dif-

We claim thatHPKB

technologyfacilitates

rapid

modificationofknowledge-

basedsystems. This

claim wastested in bothphases of the

experimentbecause each

phase allowstime to

improveperformance

on testproblems.

Articles

38 AI MAGAZINE


15/26

Results of the ChallengeProblem Evaluation

We present the results of the crisis-manage-ment evaluation first, followed by the resultsof the battle-space evaluation.

Crisis Management

The evaluation of the SAIC and Teknowledgecrisis-management systems involved 7 trials orbatches of roughly 110 questions. Thus, more

than 1500 answers were manually graded bythe challenge problem developer, IET, and sub-ject matter experts at PSR on criteria rangingfrom correctness to completeness of sourcematerial to the quality of the representation ofthe question. Each question in a batch wasposed in English accompanied by the syntax of

the corresponding parameterized question (fig-ure 3). The crisis-management systems weresupposed to translate these questions into aninternal representation, MELD for the Teknowl-

edge system and KIF for the SAIC system. TheMELD translator was operational for all the tri-als; the KIF translator was used to a limited

extent on later trials.The first trial involved testing the systems

on the sample questions that had been avail-able for several months for training. Theremaining trials implemented the test andretest with scenario modification strategy dis-cussed earlier. The first batch of test questions,TQA, was repeated four days later as a retest; it

was designated TQA for scoring purposes. Thedifference in scores between TQA and TQArepresents improvements in the systems. After

solving the questions in TQA, the systemstackled a new set, TQB, designed to be closeto TQA. The purpose of TQB was to check

whether the improvements to the systems gen-eralized to new questions. After a short break,a modification was introduced into the crisis-management scenario, and new fragments ofknowledge about the scenario were released.Then, the cycle repeated: A new batch of ques-tions, TQC, tested how well the systems copedwith the scenario modification; then after four

days, the systems were retested on the samequestions, TQC, and on the same day, a finalbatch, TQD, was released and answered.

Each question in a trial was scored accordingto several criteria, some official and othersoptional. The four official criteria were (1) the

correctness of the answer, (2) the quality of theexplanation of the answer, (3) the complete-ness and quality of cited sources, and (4) thequality of the representation of the question.The optional criteria included lay intelligibilityof explanations, novelty of assumptions, qual-

ity of the presentation of the explanation, theautomatic production by the system of a repre-

sentation of the question, source novelty, andreconciliation of multiple sources. Each ques-tion could garner a score between 0 and 3 on

each criterion, and the criteria were themselvesweighted. Some questions had multiple parts,and the number of parts was a further weight-

ing criterion. In retrospect, it might have beenclearer to assign each question a percentage ofthe points available, thus standardizing all

scores, but in the data that follow, scores areon an open-ended scale. Subject-matterexperts were assisted with scoring the quality

of knowledge representations when necessary.A web-based form was developed for scor-

ing, with clear instructions on how to assign

scores. For example, on the correct-answer cri-terion, the subject-matter expert was instruct-ed to award zero points if no top-level answer

is provided and you cannot infer an intendedanswer; one point for a wrong answer without

any convincing arguments, or most requiredanswer elements; two points for a partially cor-rect answer; three points for a correct answer

addressing most required elements.When you consider the difficulty of the task,

both systems performed remarkably well.

Scores on the sample questions were relativelyhigh, which is not surprising because thesequestions had been available for training for

several months (figure 5). It is also not surpris-ing that scores on the first batch of test ques-tions (TQA) were not high. It is gratifying,

however, to see how scores improve steadilybetween test and retest (TQA and TQA, TQC

and TQC) and that these gains are general:The scores on TQA and TQB and TQC andTQD were similar.

The scores designated auto in figure 5 referto questions that were translated automaticallyfrom English into a formal representation. The

Teknowledge system translated all questionsautomatically, the SAIC system very few. Ini-tially, the Teknowledge team did not manipu-

late the resulting representations, but in laterbatches, they permitted themselves minormodifications. The effects of these can be seen

in the differences between TQB and TQB-Auto,TQC and TQC-Auto, and TQD and TQD-Auto.

Although the scores of the Teknowledge and

SAIC systems appear close in figure 5, differ-ences between the systems appear in otherviews of the data. Figure 6 shows the perfor-

mance of the systems on all official questionsplus a few optional questions. Although theseextra questions widen the gap between the sys-

tems, the real effect comes from addingoptional components to the scores. Here,

Articles

WINTER 1998 39

When youconsider thedifficulty of thetask, bothsystems

performedremarkablywell. Scores onthe samplequestions wererelatively high,

which is notsurprisingbecause thesequestions hadbeen availablefor training forseveral months.

It is also notsurprising that

scores on thefirst batch oftest questions(TQA) were nothigh. It is

gratifying,however, to seehow scoresimprove

steadilybetween testand retest


16/26

sample questions, Teknowledge got perfectscores on nearly 80 percent of the questions it

answered, and SAIC got perfect scores on near-ly 70 percent of the questions it answered.

However, on TQA, TQA, and TQB, the systemsare very similar, getting perfect scores onroughly 60 percent, 70 percent, and 60 percent

of the questions they answered. The gapwidens to a roughly 10-percent spread in

Teknowledges favor on TQC, TQC, and TQD.Similarly, when one averages the scores of

answered questions in each trial, the averagesfor the Teknowledge and SAIC systems are sim-ilar on all but the sample-question trial. Even

so, the Teknowledge system had both highercoverage (questions answered) and higher cor-

rectness on all trials.If one looks at the minimum of the official

component scores for each questionanswercorrectness, explanation adequacy, source ade-quacy, and representation adequacythe dif-

ference between the SAIC and Teknowledgesystems is quite pronounced. Calculating the

minimum component score is like findingsomething to complain about in each

answerthe answer or the explanation, thesources or the question representation. It is thestatistic of most interest to a potential user

who would dismiss a system for failure on anyof these criteria. Figure 8 shows that neither

system did very well on this stringent criteri-on, but the Teknowledge system got a perfect

or good minimum score (3 or 2) roughly twiceas often as the SAIC system in each trial.

Teknowledge got credit for its automatic trans-lation of questions and also for the lay intelli-

gibility of its explanations, the novelty of itsassumptions, and other criteria.

Recall that each question is given a scorebetween zero and three on each of four officialcriteria. Figure 7 represents the distribution of

the scores 0, 1, 2, and 3 for the correctness cri-terion over all questions for each batch of

questions and each system. A score of 0 gener-ally means the question was not answered. The

Teknowledge system answered more questionsthan the SAIC system, and it had a larger pro-portion of perfect scores.

Interestingly, if one looks at the number ofperfect scores as a fraction of the number of

questions answered, the systems are not so dif-ferent. One might argue that this comparison

is invalid, that ones grade on a test dependson all the test items, not just those oneanswers. However, suppose one is comparing

students who have very different backgroundsand preparation; then, it can be informative to

compare them on both the number and qual-ity of their answers. The Teknowledge-SAIC

comparison is analogous to a comparison ofstudents with very different backgrounds.Teknowledge used CYC, a huge, mature knowl-edge base, whereas SAIC used CYCs upper-levelontology and a variety of disparate knowledge

bases. The systems were not at the same levelof preparation at the time of the experiment,

so it is informative to ask how each system per-formed on the questions it answered. On the

Articles

40 AI MAGAZINE

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

SQ' TQA TQA-Auto TQA' TQB TQB-Auto TQC TQC-Auto TQC' TQD TQD-Auto

TFS

SAIC

Official Questions and Official Criteria Only

AverageS

core

Figure 5. Scores on the Seven Batches of Questions That Made Up the Crisis-Management Challenge Problem Evaluation.

Some questions were automatically translated into a formal representation. The scores for these are denoted with the suffix Auto.


17/26

Movement Analysis

The task for movement analysis was to identifysitescommand posts, artillery sites, battlepositions, and so ongiven data about vehiclemovements and some intelligence sources.Another component of the problem was toidentify convoys. The data were generated bysimulation for an area of approximately10,000 square kilometers over a period ofapproximately 4 simulated days. The scenarioinvolved 194 military units arranged into 5brigades, 1848 military vehicles, and 7666civilian vehicles. Some units were collocated atsites; others stood alone. There were 726 con-voys and nearly 1.8 million distinct reports ofvehicle movement.

Four groups developed systems for all or partof the movement-analysis problem, and SAICand Teknowledge provided integration sup-port. The groups were Stanfords SMI, SRI,UMass, and MIT. Each site identification wasscored for its accuracy, and recall and precisionscores were maintained for each site. Supposean identification asserts at time tthat a battal-

ion command post exists at a location (x, y). Toscore this identification, we find all sites with-in a fixed radius of (x, y). Some site types arevery similar to others. For example, all com-mand posts have similar characteristics; so, ifone mistakes a battalion command post for,say, a division command post, then oneshould get partial credit for the identification.The identification is incorrect but not as incor-rect as, say, mistaking a command post for an

artillery site. Entity error ranges from 0 forcompletely correct identifications to 1 forhopelessly wrong identifications. Fractionalentity errors provide partial credit for near-miss identifications. If one of the sites withinthe radius of an identification matches theidentification (for example, a battalion com-mand post), then the identification score is 0,but if none matches, then the score is the aver-age entity error for the identification matchedagainst all the sites in the radius. If no site

exists within the radius of an identification,then the identification is a false positive.

Recall and precision rates for sites aredefined in terms of entity error. Let Hbe thenumber of sites identified with zero entityerror,Mbe the number of sites identified withentity error less than one (near misses), andRbe the number of sites identified with maxi-mum entity error. Let Tbe the total number ofsites, Nbe the total number of identifications,andFbe the number of false positives. The fol-lowing statistics describe the performance ofthe movement-analysis systems:

zero-entity error recall =H/ Tnon-one-entity error recall = (H+M) / Tmaximum-entity error recall

= (H+M+R) / Tzero-entity error precision =H/ Nnonone-entity error precision

= (H+M) / Nmaximum-entity error precision

= (H+M+R) / Nfalse positive rate =F/ N

Articles

WINTER 1998 41

0 .00

20 .00

40 .00

60 .00

80 .00

10 0 .00

12 0 .00

14 0 .00

16 0 .00

18 0 .00

SQ' TQA TQA-Auto TQA' TQB TQB-Auto TQC TQC-Auto TQC' TQD TQD-Auto

All Questions and All Criteria

TFS

SAIC

AverageSco

re

Figure 6. Scores on All Questions (Including Optional Ones) and Both Official and Optional Criteria.


18/26

Articles

42 AI MAGAZINE

0% 20% 40% 60% 80% 100%

Distribution of Scores

TFS-SQp

SAIC-SQp

TFS-TQA

SAIC-TQA

TFS-TQA'

SAIC-TQA'

TFS-TQB

SAIC-TQB

TFS-TQC

SAIC-TQC

TFS-TQC'

SAIC-TQC'

TFS-TQD

SAIC-TQD

3

2

1

0

Score

Figure 7. The Distribution of Correctness Scores for the Crisis-Management Challenge Problem.

Correctness is one of the official scoring criteria and takes values in {0, 1, 2, 3}, with 3 being the highest score. This graph shows the pro-

portion of the correctness scores that are 0, 1, 2, and 3. The lower bar, for example, shows that roughly 70 percent of the correctness scoreswere 3, and roughly 10 percent were 1 or 0.


19/26

Articles

WINTER 1998 43

0% 20% 40% 60% 80% 100%

Distribution of Minimum Scores

TFS-SQp

SAIC-SQp

TFS-TQA

SAIC-TQA

TFS-TQA'

SAIC-TQA'

TFS-TQB

SAIC-TQB

TFS-TQC

SAIC-TQC

TFS-TQC'

SAIC-TQC'

TFS-TQD

SAIC-TQD

3

21

0

Score

Figure 8. The Distribution of Minimum Scores for the Crisis-Management Challenge Problem.

Each question is awarded four component scores corresponding to the four official criteria. This graph shows the distribution of the min-

imum of these four components over questions. With the lowest bar, for example, on 20 percent of the questions, the minimum of the 4component scores was 3.


20/26

15 percent and are not shown in figure 9. Of

the participating research groups, MIT did not

attempt to identify the types of site, only

whether sites are present; so, their entity erroris always maximum. The other groups did not

attempt to identify all kinds of site; for exam-

ple, UMass tried to identify only battle posi-

tions, command posts, and assembly-staging

areas. Even so, each group was scored against

all site types. Obviously, the scores would be

somewhat higher had the groups been scored

against only the kinds of site they intended to

identify (for example, recall rates for UMass

ranged from 20 percent to 39 percent for the

sites they tried to identify).

Precision rates are reported only for the SMI

and UMass teams because MIT did not try to

identify the types of site. UMasss precision

was highest on its first trial; a small modifica-

tion to its system boosted recall but at a signif-

icant cost to precision. SMIs precision hovered

around 20 percent on all trials.

Scores were much higher for convoy detec-

tion. Although scoring convoy identifications

posed some interesting technical challenges,

the results were plain enough: SMI, UMass,

Recall and precision scores for the groups are

shown in figure 9. The experiment design

involved two phases with a test and retest

within each phase. Additionally, the data setsfor each test included either military traffic

only or military plus civilian traffic. Had each

group run their systems in each experimental

condition, there would have been 4 tests, 2

data sets with 2 versions of each (military only

and military plus civilian), and 4 groups, or 64

conditions to score. All these conditions were

not run. Delays were introduced by the process

of reformatting data to make it compatible

with a scoring program, so scores were not

released in time for groups to modify their sys-

tems; one group devoted much time to scoring

and did not participate in all trials, and anoth-

er group participated in no trials for reasons

discussed at the end of this section.

The trials that were run are reported in fig-

ure 9. Recall rates range from 40 percent to just

over 60 percent, but these are for maximum-

entity error recallthe frequency of detecting

a site when one exists but not identifying it

correctly. Rates for correct and near-miss iden-

tification are lower, ranging from 10 percent to

Articles

44 AI MAGAZINE

0 20 40 60 80MIT A Test Mil

MIT A Test Mil/Mil+Civ

MIT A Retest Mil

MIT A Retest Mil/Mil+Civ

MIT B Test Mil

MIT B Test Mil/Mil+Civ

SMI A Test Mil

SMI A Test Mil/Mil+Civ

SMI B Test Mil

UMass A Test Mil

UMass A Retest Mil

UMass B Test Mil

Non-one EE Precision

Zero EE Precision

Max EE Recall

Figure 9. Scores for Detecting and Identifying Sites in the Movement-Analysis Problem.

Trials labeledA are from the first phase of the experiment, before the scenario modification. Trials labeledB are from after the scenario mod-ification.Mil andMil + Civrefer to data sets with military traffic only and with both military and civilian traffic, respectively.


21/26

and MIT detected 507 (70 percent), 616 (85percent), and 465 (64 percent) of the 726 con-

voys, respectively. The differences betweenthese figures do not indicate that one technol-ogy was superior to another because each

group emphasized a slightly different aspect ofthe problem. For example, SMI didnt try todetect small convoys (fewer than four vehi-

cles). In any case, the scores for convoy identi-fications are quite similar.

In sum, none of the systems identified site

types accurately, although they identified alldetected sites and convoys pretty well. The rea-son for these results is that sites can be detect-

ed by observing vehicle halts, just as convoyscan be detected by observing clusters of vehi-cles. However, it is difficult to identify site

types without more information. It would beworth studying the types of information thathumans require for the task.

The experience of SRI is instructive: The SRIsystem used features of movement data to

infer site types; unfortunately, the data did notseem to support the task. There were insuffi-cient quantities of training data to train classi-

fiers (fewer than 40 instances of sites), butmore importantly, the data did not have suffi-cient underlying structureit was too random.

The SRI group tried a variety of classifier tech-nologies, including identification trees, nearestneighbor, and C4.5, but classification accuracy

remained in the low 30-percent range. This fig-ure is comparable to those released by UMasson the performance of its system on three

kinds of site. Even when the UMass recall fig-ures are not diluted by sites they werent look-

ing for, they did not exceed 40 percent. SRIand UMass both performed extensive ex-ploratory data analysis, looking for statistically

significant relationships between features ofmovement data and site types, with little suc-cess, and the disclosed regularities were not

strongly predictive. The reason for these teamsfound little statistical regularity is probablythat the data were artificial. It is extraordinari-

ly difficult to build simulators that generatedata with the rich dependency structure ofnatural data.

Some simulated intelligence and radar infor-mation were released with the movement data.Although none of the teams reported finding it

useful, we believe these kind of data probablyhelp human analysts. One assumes humansperform the task better than the systems

reported here, but because no human eversolved these movement-analysis problems,one cannot tell whether the systems per-

formed comparatively well or poorly. In anycase, the systems took a valuable first step

toward movement analysis, detecting mostconvoys and sites. Identifying the sites accu-

rately will be a challenge for the future.

Workarounds

The workaround problem was evaluated in

much the same way as the other problems: Theevaluation period was divided into two phases,

and within each phase, the sys

Documents

DARPA HighPerformance Knowledge Bases Project