Semantic Interoperability Community of Practice … · Web viewSemantic Interoperability Community of Practice (SICoP) Best Practices Committee Federal Chief Information Officers

Semantic Interoperability Community of Practice (SICoP) Best Practices CommitteeFederal Chief Information Officers CouncilGoogle: SICoP,DRM 3.0 and Web 3.0

Operationalizing the Semantic Web/Semantic Technologies: A roadmap for agencies on how they can take advantage of semantic technologies and begin to develop Semantic Web implementationsAdvanced Intelligence Community R&D Meets the Semantic Web (ARDA AQUAINT Program)

White Paper Series Module 3June 18, 2006, Version 1.0July 27, 2007, Version 1.1 See Addendum

Executive Editors and Co-Chairs:Brand Niemann, U.S. EPA, and SICoP Co-ChairMills Davis, Project10X, and SICoP Co-Chair

Principal Author:Lucian Russell, Expert Reasoning & Decisions [email protected]

Contributors:Bryan [email protected]

1

SICoP Meeting - Morning Session

1.0 Background

At the request of Dr. Lucian Russell the Semantic Interoperability Community of Practice (SICoP) organized a special meeting February 6th 2007 to consider the issue of "Building DRM 3.0 and Web 3.0 for Managing Context Across Multiple Documents and Organizations." The reason for this workshop was to explore further the Data Reference Model (DRM) specific implications of the IKRIS presentation at the October 16th 2006 SICoP workshop: http://colab.cim3.net/file/work/SICoP/2006-10-10/Presentations/CWelty10102006.pptThe acronym stands for the Interoperable Knowledge Representation for Intelligence Support. The presentation was given by Co-Principle Investigator Dr. Chris Welty: http://domino.research.ibm.com/comm/research_people.nsf/pages/welty.index.html

The presentation describes the IKRIS project, one of a number of unclassified projects funded over the last several years by the Advanced Research and Development Activity (ARDA) of the Intelligence Community. ARDA programs are now within the Disruptive Technology Office (DTO) of the Office of the Director of National Intelligence (DNI). The IKRIS project developed IKRIS Knowledge Language (IKL). This language can be used to translate among a number of different powerful knowledge representations. It encompasses First Order Predicate Calculus (ISO's Common Logic) and has the necessary extensions to include non-monotonic logic; it also admits some Second Order Predicate Calculus expressions.

Given that this higher level interoperable representation of knowledge was announced April 19th 2006, its capabilities were unknown to the writing team that produced the Data Reference Model (DRM) Version 2.0. The latter document was built upon an understanding of Computer Science that can, at best, be described as a 2004 baseline. The DRM Version 2.0 discusses the topics of Data Description, Data Context and Data Exchange (Chapters 3, 4 and 5), called standardization areas. It provides an Abstract model (Chapter 2) that contains abstract Entities and Relationships among them, but shows that such Entities are distinctly allocated to the standardization areas.

The IKRIS Knowledge Language (IKL), however, creates the ability to specify DRM Entities and Relationships in a new more powerful way, one that favorably changes the cost/benefit ratio of information sharing by orders of magnitude. This will lead to a new Abstract Model, but its details are as of yet unknown. This SICoP Workshop's goal is to initiate the process of identifying the elements of that new Abstract Model that will in turn lead to a DRM Version 3.0.

The SICoP February 6th morning session was organized as a Special Conference to explore the implication of the existence of IKRIS. It was Special because it brought together two members of the writing team of the DRM 2.0, the manager of a key government program whose artifacts were the basis for much of the substance of the DRM 2.0 guidance sections, and representatives of three of the world's outstanding research organizations:

2

http://domino.research.ibm.com/comm/research_people.nsf/pages/welty.index.html

http://colab.cim3.net/file/work/SICoP/2006-10-10/Presentations/CWelty10102006.ppt

Dr. Christiane Fellbaum Princeton University: WordNet Dr. John Prange Language Computer Corporation Dr. Michael Witbrock Cycorp

In the session they discussed their work which collectively opens up a new way of envisioning Artifacts and Services for Data Sharing.

2.0 The Data Reference ModelIf it seems that the idea of redefining the Abstract Model of the DRM is "extreme", this history will show that the DRM 2.0 is already a change from DRM 1.0. Hence there is precedent.

2.1 Public History The writing team for the Data Reference Model Version 1.0 completed their work in December 2003 and the document was released September 2004. It contained the notion of the three Standardization areas, shown in Figure 1:

Data SharingData DescriptionData Context

Together they defined the categories that would contain the artifacts and services to enable information sharing.

In the DRM 1.0

http://www.whitehouse.gov/omb/egov/documents/fea-drm1.PDF the following description (DRM 1.0 Page 4) of the three areas was provided:

"Categorization of Data: The DRM establishes an approach to the categorization of data through the use of a concept called Business Context. The business context represents the general business purpose of the data. The business context uses the FEA Business Reference Model (BRM) as its categorization taxonomy.

3

Figure 1: DRM Version 1.0 Three Part Structure (DRM 2.0 Below)

http://www.whitehouse.gov/omb/egov/documents/fea-drm1.PDF

Exchange of Data: The exchange of data is enabled by the DRM's standard message structure, called the Information Exchange Package. The information exchange package represents an actual set of data that is requested or produced from one unit of work to another. The information exchange package makes use of the DRM's ability to both categorize and structure data.Structure of Data: To provide a logical approach to the structure of data, the DRM uses a concept called the Data Element. The data element represents information about a particular thing, and can be presented in many formats. The data element is aligned with the business context, so that users of an agency's data understand the data's purpose and context. The data element is adapted from the ISO/IEC 11179 standard."

A project to complete the DRM Version 2.0 was initiated in 2005 and the resulting evolution of understanding is shown in Figure 2. In December 2005 the DRM 2.0 was released: http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf . In this version the description had evolved to the point that the concepts were renamed Standardization Areas and were re-characterized. Also, it featured Communities of Interest (COI). The text states: "Data Context is a standardization area within the DRM. A COI should agree on the context

of the data needed to meet its shared mission business needs. A COI should be able to answer basic questions about the Data Assets that it manages. "What are the data (subject areas) that the COI needs? What organization(s) is responsible for maintaining the data? What is the linkage to the FEA Business Reference Model (BRM)? What services are available to access the data? What database(s) is used to store the data?" Data Context provides the basis for data governance with the COI.

Data Description is a standardization area within the DRM. A COI should agree on meaning and the structure of the data that it needs in order to effectively use the data.

Data Sharing is a standardization area within the DRM. A COI should have common capabilities to enable information to be accessed and exchanged. Hence the DRM provides guidance for the types of services that should be provisioned within a COI to enable this information sharing." (DRM Version 2.0 Page 6)

There are differences. The Information Exchange Package survived as the Exchange Package, Data Content and Business Context were replaced by the more general Data Context and Data Description. The new DRM totally lacks the notion of the Data Element.

2.2 Key Decisions of the DRM 2.0 Final Writing TeamThe text of the DRM 2.0 Version 2.0 was on the one hand the product of a large collaborative effort over several months and on the other a collaboration of a small number of specialists for two weeks. The work of the former was documented in the DRM Wiki that tracked the evolution of the Standard. The work of the latter has hitherto not been made public. The team did not change the Abstract Model, nor the excellent work on examples (Sections 3.6, 4.6 and 5.6) provided by the Department of the Interior. Some information was consolidated, some information was

4

Figure 2: DRM 2.0 Three-Part Structure

http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf

removed to Appendices, some was posted for reference on the Wiki and some descriptive information was re-written. As the final product may have deviated from some people's expectations the reasoning underlying the key decisions is provided below.

2.2.1 Audience for the DRM 2.0In a conference call prior to the assembling of the final writing team the following principle was enunciated by Dr. Russell: To be a Federal Enterprise Architecture (FEA) standard the Data Reference Model had to be both a reference and a model. 1. For a document to be a reference it must, by means of a compare and contrast process, be

usable to render a judgment whether a data artifact or a process to enable sharing is or is not in compliance with the standards in the document.

2. For the DRM's logical artifacts and their relationships to be a model they had to be an abstraction that was (a) simpler than their implementations but (b) common to all of them.

Once the team was assembled the first decision that had to be made was "What is the audience for the DRM?" If the document was to describe a reference and a model the audience had to be one that was interested in such a document and could use it. Hence, the DRM Version 2 in Chapter 2, Section 2 2.1. - Target Audience and Stakeholders - states: "The target audience for DRM 2.0 is: Enterprise architects Data architects"

That decision meant that management issues had to be addressed separately. This was the focus of a companion volume, the Data Reference Model Management Guide. The final editing for this volume was completed in the Winter of 2006, but the volume is not yet released by the OMB.

The fact that the DRM 2.0 was aimed for a technical audience also explains the prominence of the Communities of Interest. If one looks at the Data Description segment of the Abstract Model, Figure 2-5 in Chapter 2 Section 2.3, it might appear that it was designed to describe Relational Databases on the one hand and files. Then the Data Context segment would define terms for a keyword search. Such an interpretation would be too limiting. The model is meant for all types of data, as is stated in Chapter 3 (Data Description), Section 3.3 Guidance:

"… the government's data holdings encompass textual material, fixed field databases, web page repositories, multimedia files, scientific data, geospatial data, simulation data, manufactured product data and data in other more specialized formats. Whatever the type of data, however, COIs specializing in them have developed within the government and external stakeholder organizations." (DRM 2.0 Page 20)

At the Feb 6th workshop the Global Change Master Directory project at NASA described a site with nearly a two decade record of supporting successful data sharing among stakeholders for nearly 20 petabytes of data! The many agencies and their specialists who use this site obviously know what they are doing. A similar community is the Geospatial Community of Practicehttp://colab.cim3.net/cgi-bin/wiki.pl?GeoSpatialCommunityofPractice. The Introduction and Guidance sections in Chapters 3, 4 and 5 of the DRM Version 2.0 were written to ensure that successful data sharing practices within the government would be allowed to continue.

5

http://colab.cim3.net/cgi-bin/wiki.pl?GeoSpatialCommunityofPractice

2.2.2 The Primacy of Data Sharing The Writing committee accepted the principle that whatever was written about the artifacts that were mentioned in Data Description and Data Context, the role of such artifacts and their instantiations was to support a variety of services needed for effective Data Sharing. The details of how this was to be accomplished for any specific data collections was then left to the individual COIs. The services would be defined as needed by the COI, and the Data Description and Data Context artifacts would be generated so as to meet the needs of those services.

2.2.3 The Technical Approach The decision was made not to change the abstract model during November's writing session but the team recommended that it be reviewed later by a panel of experts in the relevant Computer Science disciplines. Where there were reservations about the adequacy of the model the team decided to address relevant issues by making changes in the descriptive wording of two Chapter Sections, Introduction and Guidance.

With respect to the Data Exchange section the key author, Bryan Aucoin, made the assertion that the concept of a document was sufficiently broad in his view to allow a complex inner structure to be present. This was a very important decision. In reality there is no such thing as a file containing data that has no structure. However the term "Unstructured Data Resource" was part of the Abstract Model in the Data Description Chapter. The team agreed that Bryan’s document could be mapped to the either a Semi-Structure Data Resource or an Unstructured Data Resource. The distinction made in the DRM was that a Structured Data Resource was one whose structure was static and hence could be "factored out" into a Data Schema. This approach allows one to categorize as "unstructured" all data files whose structure is defined (a) depending on the context provided by the file’s data, or (b) have a structure known to the application programs that are used to process them. An example of type (a) files ate the groups of files containing nuclear power plan safety analysis simulation data, e.g. NASTRAN input and output files. Examples of type (b) files are the various types of image files, identified by their unique suffixes. These are intended to be used by application programs that recognize image data, e.g. “.jpg” files.

The wording in the Data Context section was also carefully constructed to allow the word Topic to have room for interpretation. Although it looks similar to a simplistic "keyword" provision was made to allow it to be represented by a more complex artifact. This is because of a major ARDA/DTO funded effort to improve the lexical database WordNet to help with the automatic identification of meanings expressed by polysemous words in context.

3.0 Deficiencies in the DRM 2.0The DRM 2.0 reflected a gap between the needs of organizations that wished to share data and the availability of mechanisms to meet those needs. This is because the Basic and Applied Research areas of the underlying Computer Science disciplines were at the time deficient in results that would allow those needs to be addressed. Lacking a firm conceptual basis for mechanisms that would meet those needs they were left unmet. The COIs data specialists, however, were well aware of the defects within their domains, and hence were expected to deal with them as best they could. The defects are primarily in three areas:

6

3.1 Defects in the Abstract Model and its Data DescriptionsThe Abstract Model is described as if it were a Structured Data Resource. It is described using the well known concepts Entity, Relationship, Attribute and Data Type. This set of concepts is some 30 years old. Less well known, however, is that because they were initially developed to help document computer systems’ databases they have been found to be of limited value in supporting data sharing. Specifically they do not help one to address three issues, (1) Large Data Collections, (2) Schema Mismatch and (3) the lack of linguistic precision with respect to Topics.

3.1.1 Massive Data CollectionsA technical solution that works for data collections that are a gigabytes in size do not necessarily scale to data collections that are a million times larger. This is the size of the government's data collection; it is measured in petabytes (10**15 bytes). To access this amount of data entails developing a set of linked directories, each with its own set of abstract topics and embedded index pointers to other directories. Fortunately there is a template for managing such large collections of data. It is the system called the Global Change Master Directory project, hosted by NASA. It is described in section 6 of this report. It has a network of Data Assets, Data Resources and Topics that have allowed it to address the data sharing needs of multiple government agencies and public users. The DRM wording allows it to be constructed, but as there are no explicit DRM concepts that reflect the index structure one has to infer the solution. This report brings attention the GCMD as a template for data sharing. It is a highly significant best practice.

3.1.2 Schema Mismatch The implicit idea in the Chapter 3 is that the Data Description, notably the Data Schema should be an information rich artifact that allows agencies to provision services (described in Chapter 5) which (a) discover and (b) enable access to identically structured data. It is hoped further that a broader spectrum of data can be made accessible. Section 3.2.1, “What is Data Description and Why is it Important”, looks hopefully towards the future (DRM Version 2.0 Page 19)

Semantic Interoperability1: Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enables enhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data….-----------------------------------------------1 From Adaptive Information, by Jeffery T. Pollock and Ralph Hodgson, John Wiley and Sons, Inc., ISBN 0-471-48854-2, 2004. p. 6.

To overcome problems of such interoperability for Structured Data Resources, however, one must confront the decades-old problem of Schema Mismatch. This is the condition that causes two Structured Data Resources to contain the same data but have it represented by two different incompatible schemas.

How difficult a problem is it to overcome Schema Mismatch? There are three sub-problems:1.Syntactic Mismatch2.Entity-Attribute Mismatch

7

3.Attribute-Value Mismatch

3.1.2.1 Syntactic MismatchSyntactic mismatch is often thought of as contrasting date formats for attributes or character strings vs. integers in a field like "Social Security Number", i.e. dashes in between number groups. There is a very complete description of all of these in the article by Won Kim and Jungyun Seo: Classifying Schematic and Data Heterogeneity in Multidatabase Systems (IEEE Computer 24(12): 12-18 (1991). The list is extensive, but the approach of using an XML database as an intermediate repository could be useful in this case. It also applies to data value representations, e.g. “1” vs. “.9999998” which would have to be handled by special rules.

3.1.2.2 Entity-Attribute MismatchEntity-Attribute Mismatch is a matter of data modeling, a choice in the semantics of the domain being modeled. Computer science courses at the undergraduate and even Masters level have taught that in data modeling there is an underlying triple of Entity or Object, Attribute and Value. It has been known for over 20 years, however, that an Entity in one SDR can be an Attribute in another. Therefore the schemas of the two Structured Data Resources would have a different number of Entities, and the Entities would have a different number of Attributes. Moreover the names of each are often made as short as possible to make the human-computer interface less cumbersome when ad hoc queries are formulated.

3.1.2.3 Attribute-Value Mismatch The above problem is quite manageable, however, compared to the next one: Attribute-Value interchangeability. In the paper Language Features for Interoperability of Databases with Schematic Discrepancies the authors Ravi Krishnamurthy, Witold Litwin and William Kent (Proceedings of the 1991 SIGMOD Conference, pp. 40-49.) discuss for the first time: "A less addressed problem is that of schematic discrepancies … when one database's data (values correspond to metadata (schema elements) in others … "

To quote form the paper further:"Example: Consider three stock databases. All contain the closing price for each day of each stock in the stock market. The schemata for the three databases are as follows:

database euter(2):relation r : {(date, stkCode, clsPrice) . ..}

database chwab(2):relation r : {(date, stkl, stk2, . ..) . . .}

database ource(2):relation stkl : {(date, clsPrice) . . .},reiation stk2 : {(date, clsPrice) . . .},

The euter database consists of a single relation that has a tuple per day per stock with its closing price. The chwab database also has a single relation, but with one attribute per stock, and one tuple per day, where the value of the attribute is the closing price of the stock. The ource database has, in contrast, one relation per stock that has a tuple per day with its closing prices

8

(3). For now we consider that the stkCode values in euter are the names of the attributes and relations in the other databases (e.g., stk 1, stk2). … These schematically disparate databases have similar purposes although they may deal with different stocks, dates, or closing prices…-------------------------------------(Footnotes)(2) Any similarity of names of the database (i.e., (R)euter, (S)chwab, and (S)ource) to popularly known names is purely coincidental. (3) This ource schema may seem contrived to a database researcher but low and behold, this is a popular schema among stock market data vendors."Although it is now clear that a service that addresses that interoperability of data by accessing data elements is needed for data sharing. However, the DRM Schema construct neglects to even mention Data Values.

There was a suggestion during 2005 that agencies document their Data Schemas using XSCHEMA and transmit them to a central repository. Given the Computer Science results cited above many additional steps would have to be taken to effectively support data sharing in this manner. That is why the Discovery Services approach was adopted in DRM Version 2.0. As for how to supply the information necessary to detect and perhaps remedy the mismatches, one should start by looking back at Data Dictionaries. There was considerable work done on Data Dictionaries some 20 years ago, but as the work producing them did not appear to be cost effective they were not widely implemented. The key would be to use the Document that is associated with the schema which could be semi-structured to contain the Dictionary as a Structured Data Resource as well as text. A project to re-visit this approach in the light of the technologies just described here was initiated at George Mason University on June 11th by a student under the guidance of Prof. Edgar Sibley.

The Data Description section of the DRM could be changed in the next version to supply such information. Currently it would have to be supplied by a linked document, but as we shall see later such documents can be generated in a way that is extremely useful even at this time.

3.1.2.4 The Role of RelationshipsThis is really a sub-topic of schema mismatch. It addresses the fact that a “Relationship” in the Abstract Model is used primarily as a human-readable markup technique to help understand the model. It is not actionable and schema mismatch problems are not ameliorated by using Relationships.

The Relationship idea was introduced in 1976 by Peter Chen in the ACM Transactions on Database Systems Volume 1 Issue 1. It has been shown useful as a documentation tool for software engineers to relate the data to a specification its exact nature. About 10 years later an article in ACM Computing Surveys that compared Semantic Data Models proposed relating it to the concept of Abstract Classes, but this is not necessarily viable as some relationships are proposed to have attributes (which Abstract Classes lack.) The best attempt to understand relationships within the context of the relational model was the final work of E.F. Codd "The Relational Model for Database Management: Version 2" which is now, unfortunately, only available in the used book market. Despite their unclear role in data models, however, Relationships persisted in the Unified Modeling Language (UML). Of course this too is a software or system specification language (http://www.omg.org), so the retention of the ideas is

9

http://www.omg.org/

not surprising. Recently it has been used to name functional relationships between classes, but that would contradict earlier assertions that there could be Relationships among three Classes.

Whatever critique is provided today, however, in fairness one must say that putting Relationships into a Data Model was a good idea at the time. It was the first step in the process of finding ways to add semantics to Data Models, semantics that the Relational Model lacked. However, it is not a construct that would likely persist in a new Abstract Model.

3.1.2.5 SummaryA more powerful abstract model that could overcome the above mentioned deficiencies of the current abstract model, would contain new concepts, and Entities that could be used by services which would overcome issues like Schema Mismatch and enable more data sharing.

3.1.3 Data Context: Topics, Taxonomies and OntologiesThe Data Context section shows Topics that are in Taxonomies. These are shown in the master model in Chapter 2 Section 2.3. This diagram of the Abstract Model also shows that a document as Unstructured Data can be a Digital Data Resource which can have a topic. It is not limited to having one topic, though each topic must be in a taxonomy.

This construct is limiting. The builders of the lexical database WordNet found that there are important classes of relationships among words and concepts that go beyond the "Is-A" relationship that characterizes the taxonomic relationship. One significant example is the “Part-Of” relationship. Although this relationship is often used in OWL-DL Ontologies (that model classes of information described by nouns) there is a new, significant use that can be made of it in networks of verbs. Processes are made up of other (sub) processes, which are “Part-Of” the higher process. Given that the IKRIS project has shown that all the Process description formalisms can be mapped interoperable to one another using IKL one should look to expand the types of structured formalisms to which Topics are related. In fact this is a critical advance that must be made when creating new Data Descriptions for Structured Data Bases. The processes that generate the data in those databases need to formally described, and the data in the Structured Data Resources mapped to what initial, transient or end events are the cause of their inclusion.

4.0 The Motivation for the use of "3"

The morning session mentioned the DRM 3.0, the Web 3.0 and SOAs. Presenting adequate material to justify such an introduction of terminology was a rather ambitious undertaking. For anything to be a "3.0", however, there must be promise that problems in a version “2.0” can be alleviated if not completely corrected by the next version. What allows the "3.0" designation to be reasonable is the expectation that the necessary technology can now be built ,given the Computer Science advances due to the ARDA/DTO sponsored projects.

4.1 The DRM Version 3.0No activity is underway to replace the Data Reference Model Version 2.0, but now in 2007 one can make significantly more progress toward reasoning about data resources than was considered

10

even possible in 2005 when the Abstract Model was developed. Where the Abstract Model shows three standardization areas it is now possible to consider a reduction to two areas, where Intelligent Awareness encompasses Data Description and Data Context. The issues are discussed below.

First let us consider a justification for a keeping a two tier model rather than rolling up everything into an amorphous "knowledge model", as shown in the PowerPoint slide image below. Whatever Computer Science and enabling technology may be able to bring to bear on the problems of data sharing there is just too much data to share. This leads to the necessity of a Query Oriented Architecture (QOA), one where automated services in a Service Oriented Architecture can dynamically assemble data to meet a query’s needs. There is a reason that this term is unfamiliar and an even better one why it should now be considered.

Queries have been a staple of data management technology for a long time. Looking back to the foundations of data management the key reason that the relational model of data was chosen was its ability to generate a View. The data in the data base was organized into Base Relations and then through an SQL statement the data could be culled, sorted, merged and joined into a very large number of new collections; each one so generated was a View. The generating SQL statement, however, was a Query, and as any query could be posed of relational data there was no need to single out Queries.

Figure 2. The Data Reference Model 3.0, Web 3.0, and SOAs

11

The data that was modeled, however, was the data of business. Data about the real world was handled differently and no general purpose query systems existed that encompassed such data. That is not to say there was no attempt to do so, however. In the mid 1980s a DARPA contract provided to the Computer Corporation of America led researchers to the Probe Data Model (PDM). This was described in a technical report and many academic articles, and the papers so resulting described the many difficulties encountered in extending SQL to real world data. One major issue was that discontinuity between the domain of business forms with their discrete data and the Real World where one encounters continuous data. The latter could only be measured at points in time, and hence sampled. The SQL "join" became the "Apply-Append" operator, and in being so generalized was no longer of interest - although it handles missing data the data was still missing.

The Real World, however, remains of interest and the U.S. Government collects considerable amount of data about it. Not every aspect of the real word of interest, however, can be pre-determined and data that is stored may or may not reflect the interests of the person who believes it might be useful. To bridge the gap between users of data and collectors of data a service needs to be available to reason about the relevance of data sources. This notion of relevance is well developed in the field of Information Retrieval that enables Search Engines, but is only intuitively understood in a more general sense. The QOA is a means of making some headway

12

towards a formalization. It is one that addresses specifically the needs of assembling data from Structured Data Resources as well as other Data Resources that contain non-textual data. We must add textual data to make that other data understandable to a discovery service.

A QOA is one where a user can specify an interest in data to a system and the system will dynamically examine the data which it does have and identify its precise semantics. This description is then available as input to the querying person. It then can be re-used, which can create more detailed questions or else change the intended semantics so that a better answer is retrieved the next time. The first issue to consider is what really is a “query” and to do so we look again at the most well known query language SQL.

One of the lesser known aspects of Relational Database theory is that under a closed world assumption - the database is all the data - SQL queries are expressions in First Order Logic. This was proven in 1978-84 in a series of papers by Raymond Reiter. It was shown by later researchers to be the same in Object Oriented databases. So a query is a proof of some assertion that uses the world of data stored in a database to determine the assertions truth or falsity. Given the limited context of an SQL database it is possible to find logical assertions about those entities or Objects that are modeled. When multiple relational databases are involved, however, there is no uniform naming convention, so even identifying what data exists that would become part of a proof is problematic. If non-relational data is considered as well the problem appears to be insuperable. Fortunately there is a way to approach the problem. That way is to use language, in our case English, as the more expanded query language.

In the year 2007 there are many computer scientists and practitioners who have no knowledge of the roots of database technology, specifically the early work of the people who tried to understand data modeling. When data modeling was introduced into software engineering, such as it was in the 1970s, the idea was examine documents of how a computer system was to behave and strip away everything that was not essential. People were taught to take a page of text and use a highlighter on every noun and ignore all the rest of the words. These nouns were then grouped into Entities and Attributes. As an afterthought the verbs were put next to various lines in a graph between the Entities and called Relationships. The Relationships were documentation devices and the names were not actionable.

In order to bring into light the actual meaning of data collections, however, the original text descriptions need to be restored, or regenerated. Some new ones will have to be created. These will be parsed and rendered into a knowledge base artifact associated with the relevant Data Resource. The English Language has been examined in great detail over the last 15 years. Instead of looking at words in a sentence as a sequence of multi-meaning character strings one can now look at he words in a sentence as a set of sense-disambiguated tokens. Moreover, as we can see in the February 6th presentations by LCC and CYCORP it is now possible to take these same natural language texts and translate them into logical assertions. This means that the way to building the knowledge base needed for a Query Oriented Architecture is by means of letting the query be formed in Natural Language (English).

The queries are the questions that have been investigated in the AQUAINT program. Systems developed in the AQUAINT program parse Natural Language queries and interact with

13

knowledge collections. These can be described as a network of Data Context and Data Description artifacts. Services will reason about the query using the new knowledge bases and these artifacts will allow the service to identify relevant data in the databases. This data can be dynamically accessed and the relevant query results assembled for presentation to the user.

In the DRM 2.0 the Data Description Chapter did not mention how language should or should not be used to support Data Sharing Services. It was then only in the Data Context Chapter that any Natural Language artifacts were named: topics organized in a taxonomy. As we mentioned above that is only one type of organization of concepts in English. We need to use all of them and then go farther.

4.2 Service Oriented Architectures There is a considerable amount being written about Service Oriented Architectures. Much of it describes how the services should be organized to support the business of an organization. This, however, is not new. In addition one reads about how XML and Web services should be used. This is good because it brings to light what the services are actually doing and allows a record of the messages passed among them to be stored in a form that is readable by people. However, when one digs deep into the “fine print” of the Service Oriented Architectures one comes up against the Service Description. This is were we see that we are back to the same old low level computer technology, Application Programming Interfaces (APIs) and a text document that describes how to invoke them. The data, input and output may be described with tags, but this does not mean that the content of the tags is understood in any way other than character-string matching. Fortunately, this does not have to be the case any longer.

The IKRIS project made a major advance in that it demonstrated that the various linguistic and logical formalisms used for process specifications are interoperable, e.g. NIST’s Process Specification Language (PSL) and Cycorp’s CYC-L. Process specifications are important because a service is a process: it takes in input and after a time it provides an output. A service will be implemented by multiple sub-processes, however. These may occur in series or in parallel. The sub-processes may be one of set of alternatives, and any sub-process may occur one time, many times or not at all. A process specification is the correct way to describe a service (another of the formalisms shown to be equivalent was OWL-S, the computer services Ontology). It specification’s process steps’ input and output data can be cross referenced to data in Structured Data Resources. Within the process one can also describe the logical or metric relationships among the data items, i.e. data integrity constraints.

The only way that a process can be described in Natural Language, however, is to include a precise vocabulary of time, and use it to create time markups. This has been the objective of the TIMEML language project within AQUAINT, one that has made significant progress towards that goal. The other aspect of processes is that the topic words or concepts that describe them are verbs. Without correctly understanding how to use verbs, the words that describe changes in either State of spatio-temporal positions, one cannot specify processes and services. The TIMEML technology allows text documents such as process descriptions to be read by linguistic tools and appropriate time markups inserted to make temporal relationships precise. These are available to reasoning services which can then detect semantically equivalents among different data descriptions.

14

4.3 The Web 3.0The Web 3.0 is now being called "The Semantic Web". In January the Wikipedia was refusing to let anyone create a definition of it, but now a number of discussion points have emerged. As it relates to this discussion, however, it would be reasonable to believe we have a Web 3.0 when there are available Web services that can take the content of the Web 2.0 and create order within the chaos. Web 2.0 artifacts are characterized as being generated by people, and full of cross references, bookmarks and recommendations. Blogs and videos and shared images help make the content more diverse and personal.

At the center of such a wealth of input is a core of facts and relationships about facts. This leads to a number of different contexts being established in which the facts are considered true, suspect, or false. On any matter where there is argumentation the alleged facts used in an argument can be linked back to Web resident sources. The sources' credibility can be evaluated using ancillary evidence and believed or not. Over time, those assertions that have merit emerge as a common Knowledge Base. Realizing this requires managing a Knowledge Base that is vast, variegated and manageable. The decades long work of CYCORP provides guidance as to how this knowledge should be organized.

5.0 Language and Semantic InteroperabilityIn recent years much has been written about the projected use of "Semantic Technologies" to enable more sharing of information using Web technology, but the grounding of such discussions on a firm computer science foundation has left much to be desired. To understand semantics, however, requires understanding language, and in our case the English Language. Therefore, on February 6th following two presentations that discussed the history and promise of the DRM, the session introduced Princeton Professor Christiane Fellbaum. She is the Principal Investigator and lead scientist of the WordNet Project. The WordNet presentation by Prof. Christiane Fellbaum was solicited to explain what we now know about English and introduce to a wider audience the cross-annotated and disambiguated map of the English language that is now available.

5.1 WordNet For a history of the project see http://wordnet.princeton.edu/ and purchase the book (ISBN 0-262-06197-X). To quote from the Website:

"WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser."

WordNet must rank high in the list of books that are cited but have not been read. As part of the AQUAINT program of linguistic technology enhancement, begun by ARDA in 2001, WordNet's initial collection of synsets and word relationships has moved from a "best efforts" database to a precise reference collection of word-concept mappings. Words are grouped into sets of roughly synonymous words, called synsets, that each expresses a concept. English words are polysemous (have multiple meanings), and so appear in multiple synsets. Their meanings can be disambiguated through the other synset members, through the semantic relations among synsets,

15

http://wordnet.princeton.edu/

and through the definitions that accompany each synset. A recently completed major enhancement of WordNet was the manual disambiguation of the definitions, or glosses: The nouns, verbs, adjectives and adverbs in the glosses were linked to the appropriate synset in WordNet. This not only further clarifies the meaning of synset, but also creates a corpus of sentences (the definitions) that are semantically annotated against a large lexical database. The glosses were translated into logical form, and the variables in these forms are linked to specific WordNet senses. These data constitute a valuable resource automated reasoning and inferencing.

For those who have not read the book, it turns out that there are multiple networks among words, which means that the project could accurately be renamed "Wordnets". These are discussed in the presentation stored on the Wiki, but the key points are listed below:

5.1.1 Nouns and their RelationshipsWhat are nouns? Nouns name things, and in English things can be either abstract or concrete. Whether the one or the other, however, one must be able to distinguish an instance of one thing from another. This means that there will be a number of distinguishing properties for the thing that is named by a noun.

For nouns, there are several important relationships beyond the ones familiar to Library Science, i.e. synonyms, general terms, specific terms. The synonyms give rise to the conception of the "synsets" as one means of disambiguating polysemous words .For nouns the substitutions are very likely to cause little shift in meaning (in contrast to what happens with verbs) The more general terms are called hypernyms and the less general ones hyponyms. Linguistically this relationship shows up in the form "an A is a B".

WordNet, however, has shown that there is another use of "Is-A” common in English, the telic relationship where the prior phrase is an abbreviation of "is used as a". One can see the difference in the WordNet example of the two phrases "a chicken is a bird" and "a chicken is a food". The former is an instance of the hypernym/hyponym relationship between "bird" and "chicken" and the latter is a telic relationship. In addition to the above there is the "Is A Part-of" relationship, where more general words are holonyms and more specific words are meronyms. There is more, so, as Prof. Fellbaum said at the workshop, "read the book!"

5.1.2 Verbs and their RelationshipsWhat are verbs? Verbs name events or states that have temporal properties; often they label a change of state of a displacement in space; in either case the change occurs over time.

For verbs there are relationships that are similar to those of nouns but they are not identical! For verbs the synsets and the hypernym/hyponym relationships are different. First of all, whereas for nouns replacing a word in a sentence with a synonym is unlikely to change the meaning of the sentence very much the WordNet book says the same is not true for verbs: the shifts in meaning are less subtle.

The second important difference is that although a given verb synset can have a single hypernym, unlike with nouns there is not just one type of hyponym. Rather the relationship of

16

more specific words is one of entailment. WordNet identifies four types of entailment, distinguished by temporal properties. One of these is the troponym, which is a name for a verb that has the same starting and ending time as the verb that is its hypernyms. The difference is that the troponym qualifies the manner in which the action takes place. If the hypernym is "to walk" then a troponym is "to skip".

Of particular significance to Semantic Interoperability, however, is the relationship between verbs that have the relationship of holonym/meronym. In this case the holonymic verb describes a process of which the meronymic verb is a sub-process. A process is what is behind a service, which means that creating an accurate Service Descriptions requires an accurate process description.

5.1.3 Other WordNet Word RelationshipsThe relationships among words described above are not the only ones in WordNet. Under a continuation of the AQUIANT program (Phase III) WordNet is being expanded with other relationships such as those that relate nouns to their verbal expressions. However, there is a point at which the number of relationships within the language must be capped, i.e. there needs to be a core set. This is because although human language can express many, many relationships about the real world and the world of ideas. Within them one should look for abstractions of linguistic structures. In choosing them some of the concepts that are used to describe the real world must be, and have been included, e.g. Seas. However, language that describes the specifics of the real world should specify the many relationships within the world using the core set of relationships. Thus inland seas and land locked seas are described by phrases within a language that reflects real-world situation. They are not part of the core terminology whereas the word “sea” is a linguistic concept. This issue is addressed further in the discussion below.

5.1.4 WordNet as the Lexical OntologyAlthough much has been said about "Ontologies", now we introduce a distinction with a difference. Ontologies that can be described:

using the W3C recommended Web Ontology Language OWL, specifically the Descriptive logic portion if it, and

using a free-floating sense where anything which is specified can be part of the Ontology. The structure of the Ontology is then provided by First and Higher Order logics; it is the model CYCORP uses.

There are several debates in progress in the subject area of the form and content of Ontologies. One of them is what constitutes the "best upper Ontology". This question at the present time is really a philosophic one, but the idea is simple: how do we divide up everything known to humankind into a finite collection of categories. Although there may be new answers in the future, we can look at a sub-set of that ideal Ontology, the set of all things which are described by a word or compact noun phrase in English. Specifically we look at all of such words and phrases that are in WordNet. Because WordNet limits its contents largely to those concepts for which there is a word (a "lexeme"), WordNet is a Lexical Ontology for English.

Not all concepts, even the most useful ones, have expression as a word in English. The WordNet book gives two important examples of this mismatch. In the section on verbs it was stated that

17

verbs describe a change of state or a motion in space, in both cases a change over time. There is not, however, a most general verb for "change state" or for "alter location". Instead we dynamically construct a noun-phrase for that concept as needed. There are two missing words that would, if they existed, be hypernyms for a large number of other verbs. Here, WordNet includes these concepts, even though they are not lexicalized in English.

Now we consider the other case, where English and Italian words for "wall" are compared. English has a general word for inner and outer walls. By contrast, Italian has distinct words, but no general one.

Can the lexical Ontology be "filled in"? Of course it can, as is illustrated above but then we are into concept creation. Where does it stop? Language constructs are used to build descriptions of the real world, and not every conceivable construct should be in the Lexical Ontology. It is not an enumerated list of all things in the world. So just as there needs to be a limit to the number of relationships there needs to be a limit on the number of words in the Lexical Ontology. WordNet can be extended to add more concepts but this should be done in a controlled manner. When a Community of Interest wants to expand WordNet with its domain-specific vocabulary the project is ready, willing and able (subject to resource constraints) to work with it to record the words of specific domains.

The advantage of taking things one step at a time cannot be underestimated. Expanding Semantic Interoperability requires that documents that describe a collection of data do so in a precise manner. Using WordNet effectively will allow precise word meanings to be used in sentences, which will then make the documents accessible by the enhanced discovery services that are needed to support Semantic Interoperability.

18

5.2 Well Formed Lexical OntologiesBased on the discussion of WordNet it should be clear that a verb is not a noun. Yet perhaps because of the flexibility of English this principle is overlooked when people try to create networks of words and corresponding collections of data that they call Ontologies. Some words are both nouns and verbs, though their synsets are very different. However, when the verb sense is required there is always the potential for "noun-ifying" a verb and sticking the word into an "Ontology". Another practice is to take parts of sentences, assigning nouns to nodes and connecting them with arcs that have phrases written above them. The result is called a "semantic net" or a "knowledge diagram" of something but because these assignments are ad-hoc the diagramming step adds no machine process-able knowledge. The markups are good for people to read but unless they have a precise meaning within a formalism they cannot be used systematically by enhanced discovery services ad thus do not aid Semantic Interoperability.

5.3 Suggestions on How It Should Be DoneData that is to be made available for Data Sharing should be fully described in a Document that is a Data Description. In addition to any metadata describing the document creation there should be a set of topics extracted that provide a Data Context.

In the case of Data Descriptions where there is a SCHEMA this DOCUMENT should be a disambiguated Natural Language that has an IKRIS understandable description of all processes that generate the data. All data elements would be identified as to where in the process' steps they are used, created, changed and deleted.

Ontologies in the OWL-DL sense should be created or referenced for each data item as needed, but class names should only be nouns. Non-lexical terms should only be specified as a specialization of a lexical term and specific inclusion/exclusion rules should be provided.

Logical constraints in the form of rules that hold among data values would also be explicitly stated and be made available to a Constraint Discovery Service.

6.0 The Global Change Master Directory is the Model The presentation material of the Global Change Master Directory (GCMD) is rather complete in describing how it is organized today. The detailed presentation is available on the Wiki page for the Feb 6th meeting. It is used my many government agencies to share data. In fact it is used to describe 18 Petabytes of data. This has surely no peer as a successful template for data sharing and for building the network of DRM 2.0 artifacts needed for data sharing. Why it was overlooked until the last minute, however, is an interesting study in the complexity of Computer Science today.

The Master Directory is obscure because in part it is a great success. The project's goal it to provide a means of identifying the LARGE data collections devoted to a particular concept. In order to do this there is a need to understand what is a concept, how concepts should be organized, what categories of information should be used to collect descriptive data and what inter-relationships among concepts should be supported. One reason that the GCMD works is

19

that the people involved in its specification and design are highly educated and very motivated. It would be an incorrect assumption, however, to conclude that development of the directory has been wither a straight-forward activity or without controversy. The project started in 1988 and there has been a considerable evolution.

The GCMD was known to Dr. Russell when he joined the DRM 2.0 writing team, and had it been possible to do so it would have become part of the explicit Abstract Model. Instead, through appropriate language in the Introduction and Guidance Sections of Data Description and Data Context Chapters it is there implicitly. This is important because it provides a coherent means of bringing together different concepts for Discovery Services and the data stores' description that will need to be processed to identify those that will be the subject of a Data Access Service.

The first area of success is in concept representation. The GCMD has to organize both data concerning the physical nature of the real word, measurements of natural phenomena, and the impact of the activity of human social institutions within the world. This challenge was met during the mid-1990s when to support the merger of two major data-set collection indexes two nearly orthogonal taxonomies had to be accommodated and inter-related. The one represented the data collections that had been developed to record physical phenomena. The other was a collection of data describing the impact of global climate change on human activity.

The government is an institution that holds data about the physical world, the world of social behavior and the world of individual behaviors. The concept space so encompassed is vast, and in addition to encompassing the words indexed in WordNet it extends to a vast additional technical vocabulary. To effectively develop any Taxonomy artifact mentioned in the Data Context Chapter of the DRM will require considerable reduction of that vocabulary to a smaller number of concepts. The success of the GCMD in meeting this challenge is evident. It should then be a model for how any such effort should be undertaken. The DRM 2.0 shows the way.

20

Figure 3: Network of Topics and Directory Page Documents

In Section 4.2.2 "Purpose of the Data Context Section of the DRM Abstract Model" the way is shown in the text that follows:

"Context often takes the form of a set of terms, i.e. words or phrases, that are themselves organized in lists, hierarchies, or trees; they may be referred to as "context items". Collectively, Data Context can also be called "categorization" or "classification". In this case the groupings of the context items can be called "categorization schemes" or "classification schemes." More complex Data Context artifacts may also be generated, e.g. networks of classification terms or schemes.Classification schemes can include simple lists of terms (or terms and phrases) that are arranged using some form of relationship, such assets of equivalent terms,- a hierarchy or a tree relationship structure, or - a general network."

The intent of the above language was to open the door to the use of WorldNet's exact synsets.

The second means of organization is in the pointing of the TOPIC artifact to the DATA ASSET. The latter is defined generally, but can be a DOCUMENT, and that document can have links in it that a DISCOVERY SERVICE can access. These links can be to other TOPICS, in one or more (lexical) Ontologies, or to other Documents that are pages in the Directory.

21

Topic Topic Topic TopicTopicTopic

Topic Topic TopicTopicTopicTopic

Directory Page Document

Topic Topic TopicTopicTopicTopic







This is similar to the way that the Data Collections of the government are indexed in the GCMD. This is the template for the rest of the government. It should start at the top and continue all the way down to individual databases, document collections and file collections. It has worked before and with the new tools available today the words in the topics and directory page documents can be specified unambiguously.

7.0 Language Computer Corporation's Parsing SuiteLanguage Computer Corporation (LCC) is a company that specializes in human language understanding Research and Development. It was founded 11 years ago in Dallas, Texas and established a second office in Columbia, MD in mid-2006. It employs about 70 research scientists and engineers. The research funding comes primarily from DNI/DTO, NSF, AFRL, DARPA and several individual Government Agencies. It has working software products. Its technology has been transferred to individual Government Organizations, Defense contractors and more recently to Commercial Customers. It was a major component of the AQUAINT project and has aggressively incorporated many of the advances reported by the nearly 20 research teams into their own product suite. The link to the Wiki page with the conference briefing is http://colab.cim3.net/cgi-bin/wiki.pl?SICoPSpecialConference_2007_02_06 and was in Part 2 of the morning session. The lead for the presentation is

"What can be done today to extract logical relationships from text sources?. LCC is there as the company that specializes in extracting logical representations from language, which can generate new knowledge from well crafted Data Descriptions."

The products described in the briefing fall into three categories. Information Extraction

oCiceroLite and other Cicero ProductsoExtracting Rich Knowledge from Text

Polaris: Semantic ParseroXWN KB: Extended WordNet Knowledge BaseoJaguar: Knowledge Extraction from TextoContext and Events: Detection, Recognition & Extraction

Cogex: Reasoning and Inferencing over Extracted KnowledgeoSemantic Parsing & Logical FormsoLexical Chains & On-Demand AxiomsoLogic Prover

The description of the several products' capabilities was laid out to show that there is a current capability of understanding the content of documents that was heretofore thought to be a decade away. The products success in NIST trials was also supported.

In achieving the impressive successes cited above there is a very important capability that had to be created, linguistic entailment caused by an event. The idea is simple: for the assertion that a person X performed act Y at time T to be true, there may be many other linguistic assertions that other prior acts were accomplished. That means one verb at a given time entails the truth of many assertions stated using other verbs cited at a prior time. Specifically this means that data

22

http://colab.cim3.net/cgi-bin/wiki.pl?SICoPSpecialConference_2007_02_06

describing an act Y can be taken as relevant to whether prior data described by other verbs is also relevant. This is done by understanding the meaning of words and their inter-relationships.

Because of the linguistic focus of the company it was selected to participate in the meeting because in addition to being able to identify the exact knowledge that is within text it can also extract assertions which may not have a factual basis, e.g. opinions etc.. The theory of language shows that statements in human language and these do more than just communicate facts, i.e. at least the following categories exist:

Assertives: Statements that tell others (truly or falsely) how things areDirectives: Statements telling people to do things.Commissives: Statements where we agree to commit ourselves to do thingsDeclarations: Statements where we bring about changes in our world due to an utterance.Expressives: Statements of personal feelings and attitudes.

The LCC approach to language allows it to extend into all of the above. This is important because although much of language can be translated into logical formats many or the qualitative features are not so reducible. In addition, sometimes new concepts emerge that do not have an exact definition that is agreed by all parties. Such clusters of related words can be identified by LCC's tools, e.g. the summarization suite not mentioned in the presentation.

8.0 CYCORP and IKRIS: Knowledge Management is HERE!Cycorp is the company that is the successor to the CYC project that was started by Dr. Doug Lenat in the mid-1980s at the Microelectronics and Computer Consortium (MCC). To quote their website: "Cycorp was founded in 1994 to research, develop, and commercialize Artificial Intelligence. Cycorp's vision is to create the world's first true artificial intelligence, having both common sense and the ability to reason with it." It currently has an Ontology (in the general sense) with 15,000 predicates 300,000 concepts and 3,200,000 assertions about them. It is the only such Ontology in existence. It can be extended and there are many collaborators who are actively working on such extensions.

Cycorp was a heavy contributor to the IKRIS project. This is important because the company had 20 years of experience in specifying and managing knowledge about the real world. That experience was of great value in providing a testable foundation for many ideas. The company also participated in the AQUAINT program and performed other ARDA work as well.

Over the years the CYC Ontology was very well regarded but not well understood by many. Initially the project used LISP as its programming language, and this seemed to make many of its features inaccessible for other applications. Although this is no longer an issue, there was still a lingering doubt as to whether the formalism peculiar to CYC was usable outside of the CYC system itself. IKRIS showed that this is not the case: CYC's formalisms are interoperable. This is an extremely valuable result as one seeks to move forward with the DRM 3.0, the Web 3.0 and SOAs.

The reason that CYC was brought to the SICoP meeting is that any knowledge that is needed for any advanced application can be represented now in CYC and translated to any other equally powerful formalism at any time using IKL (OWL is not a First Order Logic, so it is less

23

expressive). This means that the processes associated with verbs can be specified now - without waiting. It means that SOAs can be specified without having to wait on the (now interoperable) SOA-S specification. It means that the role of data elements in a SOA can be specified with respect to where they are in a large Ontology and what states or sample points they represent in the processes that interact with them. It means that we can start building the Core Knowledge Base for the Federal government today.

CYC pioneered the idea of a Micro-theory (MT), which is realized in IKL in the CONTEXT clause. This is extremely important as it allows the representation of statements in non-monotonic logic. Monotonic logics are the ones used in mathematics, where Pythagoras' Theorem and other such assertions are proved. Non-monotonic logic allows one to capture the results of induction, such as Newton's Law of Gravity. The difference is that Pythagoras' Theorem is always true, but Newton's Law of Gravity must be amended when velocities close to the speed of light are considered.

The critical formalism that bridges human language and the theories of science is the Contra-factual Conditional. These are used to state the laws of science. For example if one says that "glass is brittle" one asserts that "were a glass to be struck by a hammer then it would shatter". Induction, however, is only as good as the last text case. The assertion by Northern Hemisphere people that "all swans are white" was shown to be false once contact was made with Australia. In that vein, one asks "is that particular glass brittle? It is intact and not shattered!" The answer is not a law but rather an assertion about the contrafactual - the glass is hit by a hammer. The conditional is the "IF" statement. In IKRIS this is accomplished by a CONTEXT and a Process specification (hitting the glass). The preceding might seem to some to be an issue of splitting hairs, but some hairs have to be split. The inability of the world's greatest logicians in the 1930s to successfully unify language and logic was due in part to the need for a single formalism that would include the contrafactual conditional. This is now present and agreed to by all. In so doing the door is opened to the extensive use of automated reasoning processes to aid in data sharing. That does not mean that it would be easy, but it is now worth the trouble to undertake.

Another aspect of IKRIS power is that it has constructs that act as a universal solvent to the problems of Schema mismatches. Databases and metadata assertions concerning Entity, Attributes and Values can be restated in IKL and proofs about transformations among their different combinations can be stated. This is a breakthrough for a field that thought it was dead in 1991!

The CYC presentation is at the same Web site for the February 6th SICoP meeting as was referenced previously. It should be studied carefully.

24

9.0 ConclusionIKRIS was announced to the world on April 19th 2006, although not publicly until a couple of weeks later. The problems that heretofore inhibited Data Sharing efforts have been removed in theory. There are a panoply of advanced tools now available to start to remove the barriers in practice. The following three promises to be realized in the Web 3.0 are among those stated on the current "Web 3.0"page for the Wikipedia (http://en.wikipedia.org/wiki/Web_3.0) Transforming the Web into a database An evolutionary path to artificial intelligence The realization of the Semantic Web and SOA

The answer is "yes - these promises can be realized" due to IKRIS and the adjunct linguistic technologies due to the projects sponsored by ARDA/DTO.

We can start today!

25

http://en.wikipedia.org/wiki/Web_3.0

Addendum to SICoP White Paper 3:

Date: Thursday, July 12, 2007 08:38PM Subject: IKRIS

Lucian, I really enjoyed your presentation at the Semantic Technologies Conference in June and appreciated the opportunity to hear more of your perspective during one of the evening social events. I found your formal presentation exceptionally well organized and lucid. Your description of the advances by DTO in the AQUAINT and IKRIS projects wasexciting in the prospects it identified for applying their technologies to promote semantic interoperability.

Your discussion of IKRIS was most evocative of effective interoperability amongst systems using different semantic languages. But, when I went to investigate translators for IKRIS, I was disappointed at being unable to find any. I contacted LCC, only todiscover that they did not develop any translators (per the appended response). And, I asked Eric Rickard and Mike Blair of DTO about this at the last SWIG meeting and they indicated that DTO did not follow through on the IKRIS project to make translators available. Thus, it seems that support tools for IKRIS are not quite ready to enable the sort of interoperability that you discussed. Please let me know if you know of any other tools or activities that will help make IKRIS-based translations real.

Also, I got the impression from Brand Niemann that there was a recommendation by the SICoP to the Best Practices committee to consider Cyc and the LLC tools as a basis for semantic interoperability across the Government. Do you know what the basis was for this recommendation - was there a vote at the SICoP meeting on this or on the SICoP Wiki?

The Intelligence Community and DoD are currently addressing issues of semantics in their information sharing standards. So, any light you can shed on your proposals on IKRIS and Cyc might be helpful.

Brian A. Haugh, Ph.D. Information Technology and Systems DivisionInstitute for Defense Analyses

Date: Thursday, June 21, 2007 12:37 PMSubject: [Fwd: Request for information on products]

Brian:Andy forwarded your e-mail on IKRIS and IKL. We were part of the NIMD program and currently under CASE program. We don't have any translation tools to go between IKL and other languages you mentioned.

Munirathnam SrikanthLanguage Computer Company

26

Date: Jun 19, 2007 7:23 PMSubject: RE: Materials from LCC (1 of 3)

Andy,Thanks much for all the documentation. I look forward to reviewing it.

I'd also be interested to know who I might contact in your organization about translation capabilities relative to IKRIS and IKL. More specifically, I'd like to know if you have any translation tools to go between IKL, OWL, CycL, PSL, CL, or SOA-S, and if so which translations are supported in which directions and at what stage of development.

Brian A. Haugh, Ph.D.Information Technology and Systems DivisionInstitute for Defense Analyses

Date: Friday, July 13, 2007 08:09AM Subject: RE: IKRIS

Brian, Thanks for your comments. Lucian has following new e-mail address:[email protected]

Lucian, Please see Brian's comments below.

Hi Chris,You indicated Point of Contact at Mitre that can make available the IKRIS translation tools? Please forward along his contact information and we can dialog with him.

Brand K. NiemannSAIC | Advanced Systems & Concepts

Date: Friday, July 13, 2007 08:15AM Subject: RE: IKRIS

Brand Jr. and Brand Sr,Thank you for being on top of this. I just sent the following to Brian with a cc to Doug Lenat:

Brian,Thank you for your e-mail. As I learned at Harvard always check the original sources.

The IKRIS documentation is currently being reviewed by MITRE prior to release. However, it is not FOUO so the developers can talk about it.

There are two issues:

27

(1) do you need the logic in the IKRIS model for some project right now?(2) do you need to translate among logic models?

You can do the first but maybe not yet the second.

I couldn't say it explicitly at the Conference, but the IKRIS work validated the CYC model. What that means is that CYC has all of the representational power needed to implement an IKRIS level system. What IKRIS did was validate this fact by showing CYC's constructs were interoperable with all the other logical constructs out there. Prior to this result people thought (sorry Doug) that CYC was its own stand-alone thing; it is not at least with respect to its logic model (it has, however, its own suite of theoremprovers. This was not at all obvious. However, now that it is known one can build any model one likes using CYC.

CYC also participated in the AQUAINT project and so it too has methods of ingesting text documents and drawing conclusions. I kept LCC in the picture for two reasons (1) LCC will continue in language as it is their business and (2) there are non-logical issues in human communication that I feel need to be explored, and non-logic is not CYC's business.

I was all set to do a big push in the Semantic Interoperability area this spring but when I got back from vacation in a remote spot on March 2nd I found out that my brother had died suddenly, leaving a 18 year old son in college. Since then my energies have been mostly focused on helping stabilize the situation.

However, things are more stable and I have agreed to work with Selmer Bringsjord, part of the IKRIS team, on the Semantic Interoperability of Relational/Object databases; he has a student who has made some progress. I will start this when I return from ACM SIGIR (plus one week vacation) in early August.

Back to the issues:(1) To do IKRIS work now use CYC! I am sure that Doug Lenat (cc'd) will work with you to bring it in to NDU or IDA.(2) To get a translator to another formalism check with Chris Welty at IBM or the KSL at Stanford - they may have what you need.

Lucian Russell

28

Date: Friday, July 13, 2007 09:28AM Subject: RE: IKRIS HI all, here's an update on IKRIS things.

So far only the IKL spec, which is probably the most important document, has been cleared to Mitre's satisfaction to be put up on the IKRIS public website [http://nrrc.mitre.org/NRRC/ikris.htm]. However, I can send the rest to anyone who asks.

We developed and tested translators in software for three languages in IKRIS: CycL, KIF, and Slate (the language RPI uses). We demonstrated the efficacy of translating to/from OWL, RDF, and LCC's logic language to IKL (and thus through that to any other) but there was never any actual software.

As an IBM employee I can't really be the one to "give" software to anyone w/o some iron-clad agreement that protects IBM from a million possible lawsuits, but the software is theoretically available to distribute to anyone who wants it...

The actual software tooling effort for IKRIS still lives and has been taken over by Battelle, in a project under the DTO ARIVA program run by Dave Thurman (cc'd). So, in fact, DTO did take up the ball, at least a little, to pursue applying IKRIS. I'll let Dave address whether that software can be distributed.

The person at DTO who is most knowledgeable about the state of IKRIS is Brant Cheikes (Brant Cheikes <[email protected]>). If you're in a govt agency, you should be able to get some proactive response from him.

I don't understand the statement (assuming LCC for LLC):

"a recommendation by the SICoP to the Best Practices committee to consider Cycand the LLC tools as a basis for semantic interoperability across the Government"

The part I don't understand is LCC tools as a basis for semantic interoperability, as that is not what any LCC tools are actually for. Was there such a recommendation?

Dr. Christopher A. WeltyIBM Watson Research Center

Date: Friday, July 13, 2007 11:19AM Subject: RE: IKRIS Chris,

Thanks for the wealth of information in your response. Please do send me any other IKRIS documentation that you can share besides the spec, which I have downloaded.

29

I'm glad to hear of work on IKRIS translators. I will follow up with Brant Cheikes at DTO to see what more he can tell me.

The issue on best practices comes out of the briefing that Brand Niemann prepared for the last IC Data Management Committee meeting, where he points out semantic deficiencies in the current approach of the IC-DoD Universal Core and states:

"The Intelligence Community has supported efforts (AQUAINT Program) that have provided better tools for a concept authority (WordNet), an upper ontology (OpenCYC), and extraction of ontologies from unstructured text (LLC Corp) that were featured at the February 6th SICoP Special Conference . . ."

Brand's comments on this at the DMC meeting seemed to say that a recommendation had been made to the Best Practices Committee, though I could be mistaken. I believe that "LLC Corp" was a typo, as you suggest, and was actually intended to refer to the Language Computer Corporation (LCC), which worked the information extraction on the AQUAINT program. It is my understanding that such information extraction capabilities are seen as a key aspect of effective semantic interoperability for unstructured information sources (e.g., documents, cables).

Brian A. HaughInformation Technology and Systems Division

Date: Friday, July 13, 2007 11:54AM Subject: RE: IKRIS and the recommendation Chris,

Sorry Chris - you were out of this loop, so here's the background.

The heading of this topic is "Semantic Interoperability (and How to Get It)". The government has stated that they need it and would like more of it. In Data Reference Model version 2.0, http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf we see the following text.

"Semantic Interoperability5: Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enablesenhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data."

Where footnote 5 is:"5 From Adaptive Information, by Jeffery T. Pollock and Ralph Hodgson, John Wiley and Sons, Inc., ISBN 0-471-48854-2, 2004. p. 6."

30

http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf

So what does "enhanced automated discovery and usage of data" mean? It is kind of "fuzzy", but as a person who was responsible for editing and the final rewrite of the above cited Chapter and Section's text I am probably the best person to answer that question (I wrote it in November 2005, pre-IKRIS).

First, a context. In the DRM the other key theory person was Bryan Aucoin, who wrote the first few parts of Chapter 5. Bryan was and is a big advocate of a Services approach. The idea we agreed on was that Data Context and Data Description artifacts should be used to enable Services. That would include those that provide Semantic Interoperability (hence the phrase "even with using service-oriented architectures or business process modeling approaches").

Following this idea up, one can suppose that "enhanced" services would be ones that would understand meanings of requests for data and be able to mediate different ways in which to satisfy the request. This de-facto breaks into two sets of technical issues, those related to Natural Language (NL) and those related to Fixed Field Databases, i.e. ones with schemas (e.g. relational, Network, hierarchical, XML, object).

Now let is look at the first problem, queries of Data Resources that are represented using NL; the request that is made to A Service is a query. In the world of text this brings us into the world of Information Retrieval (I will be at ACM SIGIR next week in Amsterdam). However, IR essentially allows you describe your query as a guess at what your answer would look like; the system then does its best to find the most probable matches. What the DTO/ARDA did in the AQUAINT program was re-position the Service back in the query formulation process to emphasize the question. The "enhanced" service would then use linguistic and real-world reference data to relate the question to its known text collections. There are a host of challenges to be faced here, but progress has been made. Language Computer Corporation (LCC) has linguistically oriented COTS products in this area and is moving forward quickly, which is why I mentioned them.

So, one can reasonably project the idea that "enhanced services" would make use of linguistic and real-world reference data artifacts. These obviously include Ontologies in the current OWL-DL sense. However, they require more powerful artifacts as well, those that are best expressed with an IKRIS level formalism. To see this we look at nouns and verbs.

A key issue is the entailment of related concepts. Much has been made of Ontologies recently, notable those specified with OWL-DL. As you stated in our meeting in May 2006, OWL-DL is FOL without variables. An OWL-DL Ontology is primarily described by Classes. This makes it very well suited to describing the meaning of nouns. Intersecting class taxonomies with functions to describe constraints work well for nouns. More important this approach to modeling allows Ontologies to be consistent with the Lexical Ontology of WordNet.

For Semantic Interoperability, however, we need to deal with verbs, words describing state changes and processes which are sampled. This takes a bit of explanation, provided below (which, if you are familiar with the issues you can skip.)

31

Verbs must model time. In OWL-DL if we have TERM-1 naming a class then STERM-1, STERM-2 .. STERM-n can name specializations of the concept within the class of TERM-1. However, there is only one type of subclass. For verbs the answer is different. As Prof Fellbaum of WordNet points out there is not a single class-subclass construct that is usable for verbs; rather there are four. At issue that there are 4 different temporal relationships possible between a general verb concept and its more specialized concepts. Also, considering verbs as lexemes - individual words - one finds that they not nest as deeply as nouns. As she states in the WordNet book, nouns can have 12-14 levels of more specialized meanings whereas verbs typically have at most 4 levels of specialized meanings.

My work with the IC has shown me that actually the interest in Question Answering as it relates to verbs is in processes, i.e. is gathering chemicals a case of terrorism or commerce, e.g. http://gothamist.com/2007/06/29/tons_of_chemica.php

When analyzed further what we see is that OWL-DL Ontologies have connections among concepts named as classes that are "IS-A" or hypernym/hyponym relationships whereas what is of interest in this case is meronym/holonym relationships. That is because processes have "PART-OF" sub-processes. These relationships, however, are not linguistic in nature. They are temporal models of the real world. They are best captured in Knowledge Representations of the Real World (and capturing these and storing them ishappens to be the business of CYCORP.)

So to get back to Question Answering: understanding a question means understanding the words, both nouns and verbs, and expanding the word list to consider beyond the initial ones used in the query. With nouns Ontologies work fine. With verbs we must understand more, e.g. (1) Ratifying a treaty means that previously there was a signing of a treaty(2) Arriving at a destination pre-supposed traveling to it. These connections are really linguistic issues. Any "enhanced service" must have the ability to expand out the nouns and the verbs and then reason about whether a particular document would be retrieved in response to the query. LCC's software is good at this inference.

Now to fixed field databases. Interoperability has been dead since 1991 when it was found that for text field one databases Entity could be another one's Attribute which in turn can be another one's Data Value. This means that an "enhanced" service must contend with three totally different data models. In the wild and wonderful wacky world of SQL a Query is well defined, and under the Closed World Assumption Raymond Rieter showed that this is a First Order Predicate Calculus proof. If the Entity and Attribute names in the query are different, much less there being a relationship to Data Values, then the SQL of one query must be mapped to several equivalent forms and all executed against the appropriate database. We thus have many hurdles to overcome for any "enhanced" service: we must reason about alternatives names, and the real meaning of the data in the databases.

So, how do we do all this? Specifically how do we create reference models of language and real world descriptions that can help us define the artifacts needed to create "enhanced" service that find equivalent data?

32

http://gothamist.com/2007/06/29/tons_of_chemica.php

With respect to text collections, it is possible to take existing documents in the government and add their knowledge to a Knowledge Base (whatever we define that to be). This is what CYC does. However, the government's documents cover a very large number of documents, probably excepting only Theology as a topic. So we have to extend our reference collections carefully. WordNet has most of the important non-technical vocabulary, but there are many more technical vocabularies to process. Tools such as those of LCC are working on the challenge of the linguistic extraction processes.CYC knows how to manage multiple domains of specialized concepts.

With respect to the fixed field databases more work has to be done. One aspect is finding the translations among the different modes of representation, and bringing in domain constraint modeling. Here I believe that NL will guide the way. The people who wrote the papers describing the Entity-Attribute and Attribute-Data Value equivalences used the English Language to do so. Hence it would be possible to reverse engineer the concepts to create models. However, the models would then need an enhanced explanation of the meaning of the text strings that are the Entity, Attribute and Data Value. Much work was done in the 1980s on Data Dictionaries which is being re-visited by me in concert with a key Database Researcher in this area. It is also possible that new textual descriptions of database need to be generated. What is their scope, form and content and relationships to linguistic and Real-World reference Knowledge Bases? Unknown at present.

What is the form of the "enhanced" services that will use these new artifacts one they are created? These, I believe will be reasoning engines using the full set of logical constructs developed during IKRIS. Now with anything in the government COST is a big issue. Although Semantic Interoperability is not here today it is cost effective for the government to start building towards it. Specifically I believe that the creation of metadata and semantically exact data description artifacts is a useful expenditure of funds because tools like those of LCC and CYC can process those descriptions. The key issue is that pre-2006 any artifacts defining database could only be read by people, and agencies may not have had funds to keep these up to date. The money spent for really good documentation in this time period was not well spent because people would not read the documentation. If the documentation can be read by Services then it is worth the investment. That is because tools can also reveal where the documentation must be enhanced. Tools can then inter-relate the concepts in documents.

Although LCC and CYC have toolsets I know that IBM is also working on toolsets and if available they too might be deployed right now. If so then let SICoP via Brand present them in an upcoming SICoP meeting. My impression was that the good stud was still in the Lab at Watson. If I am wrong I am very pleased. I spend $120/year for DB2 maintenance s I will probably get them all wit the next version release (IBM is very generous with software for DB2 developers).

So why do I come up with LCC and CYC? Many of the documents in the government cover issues that are not logical assertions and I believe that LCC's has a flexibility in dealing with language that makes it a good fit for many types of government documents. CYC has been

33

collecting real world models for decades and hence has mechanisms for creating the knowledge bases of real word processes. It also can model the advanced features developed into the IKRIS model, so while we wait for IKL tools we can build models in CYC. Notably, CYC is good for domains that are exact, like those where scientific terms and assertions are the norm. Many agencies have massive scientific collections of text documents and related databases. If you feel otherwise please let me know how and why - I was neither the IKRIS PI nor on the team so I can well have overlooked or underestimated something (or just not understood the issue!)

Who will develop these? Well, I expect that I among others will work on the artifacts and that YOU, in some manner, will work on the services (I bet the WebSphere would be VERY interested in paying for it)! Lucian Russell

Date: Friday, July 13, 2007 12:24PM Subject: RE: IKRIS

Brian,Brand, Sr. was the one with the recommendation.


Date: Friday, July 13, 2007 11:51 AMSubject: RE: IKRIS

Brand, Thanks for your input on the email. I did find Lucian's working address last night after my original message bounced.

But, please let me know the status of the cited recommendation to the Best Practices Committee. Was such a recommendation made for Cyc and LCC's tools, or was I imagining that :-)?

Brian A. HaughInformation Technology and Systems Division

Date: Friday, July 13, 2007 12:56PM Subject: RE: IKRIS Brian,With regard to the recommendations to DNI CIO based on SICoP and BP Committee that my Dad made: The recommendations were give at IC DMC meeting back in February or March and posted as a SiCOP e-mail March 9. The recommendations were made in the meeting to the presenter, Jim Feagans and to Steve Selwyn. Steve Selwyn's role has subsequently changed.

http://colab.cim3.net/file/work/SICoP/2007-03-07/SICoPICDMC03072007.ppt

34

http://colab.cim3.net/file/work/SICoP/2007-03-07/SICoPICDMC03072007.ppt


Date: Friday, July 13, 2007 12:43PM Subject: RE: IKRIS Thanks to all for a great information sharing exchange on Brian's original question and the SICoP White Paper 3 was delivered and defended in multiple venues (see slide 7 in http://colab.cim3.net/file/work/SICoP/2007-07-11/SICoP07112007.ppt) culminating in the CIOC Best Practices Committee on June 18th and we are now helping the NCOIC SIF, EPA, and others implement it to gain more feedback and experience (see http://colab.cim3.net/file/work/SICoP/2007-07-05/SICoPNCOICSIF.ppt). I will be posting an up date to the later based on the July 12th NCOIC SIF meeting soon. Brand Niemann, Sr.SICoP Co-chair

35

http://colab.cim3.net/file/work/SICoP/2007-07-05/SICoPNCOICSIF.ppt

http://colab.cim3.net/file/work/SICoP/2007-07-11/SICoP07112007.ppt