Semantic Interoperability – Yes!

Page 1

+ Semantic Interoperability – Yes!

Presentation to the CIO CouncilJune 18th 2007

Lucian Russell, Ph.D.

+

Page 2

Semantic Interoperability

• What is it?

• The Data Reference Model Version 2.0 states:– 3.2. Introduction– 3.2.1. What is Data Description and Why is it Important …

– Semantic Interoperability: Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enables enhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data.

• Semantic Interoperability is a condition that is created with respect to a Data Resource that is under the control of an Agency.

– Associated with each Data Resource is another that allows a “reasoning service” to identify its “semantics” to determine its value w.r.t. a query

– Left in the table: how are “reasoning services” created and what are the necessary additional data resources needed?

+

Page 3

In 2005 there was no direct answer, only a template

• See DRM Version 2.0 Chapter 2, Figure 2.5 – Digital Data Resources can be Structured, Semi-Structured or Unstructured and

can be contained within a document. These can describe a Data Asset. – On the other hand a Data Asset can provide a management context for a Digital

Data Resources.– Topics in a language can categorize either (i.e. they are instances of a class

designated by the topic word.)

• To support “enhanced automated discovery”, though we need to use a combination and constellation of some collection of instances of these three entities.

• Interoperability would then depend on the adequacy of the combination.

• In 2005 the way was unclear, but there was a template.

• On Page 18 “…Data Description artifacts are an output of the process of providing data syntax and semantics and a meaningful identification for a data resource so as to make it visible and usable by a COI.”

• The most effective government COI is the Global Change Master Directory

• The GCMD indexes 18 petabytes of multi-agency data: it was the template

+

Page 4

In 2006 there were several breakthroughs

• The unclassified R&D sponsored by the Intelligence Community had several important breakthroughs that impact “enhanced discovery services”

– AQUAINT – Advanced Question Answering for Intelligence WordNet was enhanced to create a disambiguated description of the most common

words in the English Language, some 115,000 words and their meanings. A markup language for time, TIMEML An extraction technique to parse English Language text and create logical relations

– NIMD – Novel Intelligence from Massive Data (FOUO) Released (not-FOUO) the slides announced a breakthrough from the IKRIS project Interoperable Knowledge Representation for Intelligence Support (IKRIS)

• The IRIS Project’s Challenge:– “How to enable interoperability of knowledge representation and reasoning

(KR&R) technology developed by multiple organizations in multiple DTO programs and designed to perform different tasks”

• The Results:– A new language IKL that translates among knowledge representation languages– An extension of logic to 2nd order and non-monotonic expressions – A proof of equivalence among process specifications

+

Page 5

These results open the way to SI using - English!

• The implication of the results are staggering – English Descriptions in documents can be used to enable enhanced automated discovery.

• There were limitations on concepts that could be represented:– Prior “semantic” technology (e.g. OWL-DL) only allowed for precise descriptions

of concepts represented by nouns, i.e. taxonomies. “Ontologies” were defined as overlapping taxonomies.

– WordNet now allows nouns to be unambiguously described.– WordNet has clearly demonstrated that nouns have single-subtype taxonomies

but verbs do not: because there is a time element in all verbs’ meanings they have four sub-classes (verbs describe 4-D motions or state changes).

– Consequently nouns and verbs cannot be intermixed meaningfully (without inconsistency) in OWL-DL Ontologies”.

– Representing concepts using verbs entails describing processes, which are multiple verbs in a “Part-of” (meronymic/holonymic) relationships.

– English descriptions of processes were imprecise because relative time concepts were heretofore too poorly understood to support automation.

• With WordNet and TIMEML we can now precisely describe the processes that create and change data as well as the nouns used for the real world.

+

Page 6

TimeML

• Markup Language for Temporal and Event Expressions

• TimeML is a robust specification language for events and temporal expressions in natural language. It is designed to address four problems in event and temporal expression markup:

– (1) Time stamping of events (identifying an event and anchoring it in time); – (2) Ordering events with respect to one another (lexical versus discourse

properties of ordering); – (3) Reasoning with contextually underspecified temporal expressions (temporal

functions such as 'last week' and 'two weeks before'); – (4) Reasoning about the persistence of events (how long does an event or the

outcome of an event last).

• The rules that identify temporal dependencies can be used to insert tags into text. These can be processed.

• Processes that entail other sub-processes can also be processed logically, i.e. infer from “A filed an application” the fact that “A filled out an application”.

• Language Computer Corporation (AQUANT) finds logical relations in text

+

Page 7

LCC Product Polaris Semantic Relations

# Semantic Relation Abbr

1 POSSESSION POS

2 KINSHIP KIN

3 PROPERTY-ATTRIBUTE HOLDER PAH

4 AGENT AGT

5 TEMPORAL TMP

6 DEPICTION DPC

7 PART-WHOLE PW

8 HYPONYMY ISA

9 ENTAIL ENT

10 CAUSE CAU

11 MAKE-PRODUCE MAK

12 INSTRUMENT INS

13 LOCATION-SPACE LOC

14 PURPOSE PRP

15 SOURCE-FROM SRC

16 TOPIC TPC

17 MANNER MNR

18 MEANS MNS

19 ACCOMPANIMENT-COMPANION ACC

20 EXPERIENCER EXP

# Semantic Relation Abbr

21 RECIPIENT REC

22 FREQUENCY FRQ

23 INFLUENCE IFL

24 ASSOCIATED-WITH / OTHER OTH

25 MEASURE MEA

26 SYNONYMY-NAME SYN

27 ANTONYMY ANT

28 PROBABILITY-OF-EXISTENCE PRB

29 POSSIBILITY PSB

30 CERTAINTY CRT

31 THEME-PATIENT THM

32 RESULT RSL

33 STIMULUS STI

34 EXTENT EXT

35 PREDICATE PRD

36 BELIEF BLF

37 GOAL GOL

38 MEANING MNG

39 JUSTIFICATION JST

40 EXPLANATION EXN

+

Page 8

• LCC’s Jaguar product can automatically generate ontologies and structured knowledge bases from text

– Ontologies form the framework or “skeleton” of the knowledge base– Rich set of semantic relations form the “muscle” that connects concepts in the knowledge base

IS-A IS-A

carry

AGENT

conduct

THEME

board

AGENTTHEME

board

MEANS

ship

transport

MEA

NS AGENT

arrive

run

stop

Joined train

passenger trainfreight train

LCC’s Jaguar: Knowledge Extraction

+

Page 9

It is now Cost Effective to “Document” Databases!

• Previously documentation of databases was a black hole for budget $$– Only people would read the documentation– It was never kept up to date– Rules within it “evolved” over time– Hence people never read the documentation anyway and the data was

inconsistent– ETL techniques, Data warehouses and Data Marts were used to get uniformity,

but substituting computer generated data for stored data is no guarantee of accuracy.

• Now text descriptions of databases can be processed automatically– The correct WordNet sense of each word can be used. A correct description of

the relationships among data attributes and the processes that describe how they were created can now be used for semantic processing.

– The text can be extracted and used to create knowledge repositories!

• AQUAINT and NIMD also enhanced the CYC Knowledge Base– CYCORP has the world’s largest general ontology and knowledge base

describing the real world. It can be extended and used for Interoperability.

+

Page 10

+

Page 11

How can this be done? Carefully!

Real World Data

Mathematical Patterns

Social World Data

Data about Individuals

Data are samples

Data are State

Changes

Data are Both

Old fashioned 1970s Data

Modeling destroys distinctions: Lost

Gold!

Gray mass of

sameness

+

Page 12

Look at each type of data and how it comes into being!

Example: A USCIS form has 10 Object types

Photograph

Signature

Fingerprints

1: Data

Elements:

Name & Country

of

Citizenship

2: Data

Elements:

Identification

Numbers

3: Data

Elements:

Residence

History

6: Data

Elements:

Arrivals &

Departures

7: Data

Elements:

Arrests &

Citations

8: Data

Elements:

Marital

Information

9: Data

Elements:

Children’s

Names

10: Data

Elements:

Parents

Country

of

Citizenship

5: Data

Elements:

Employment

History

4: Data

Elements:

Education

History

+

Page 13

Structured Data and Schema Mismatch

• Syntactic Schema Mismatch:– IEEE Computer December 1991 showed a large number of syntactic mismatches

among representations of data were a barrier to data integration or sharing.

• Entities = Attributes = Data Values – Nonsense or Computer Science?• Computer Science: Semantic Schema Mismatches

– In 1986 it was published in Computing Surveys that when looking at how to integrate databases we could see that one Database’s Entity was another Database’s Attribute

– In 1991 a research result showed that an Attribute in one Database could be a Data Value in another Database

• So, with a potential for this degree of mismatch sending XML schemas to a repository is not necessarily a help to semantic interoperability.

• The field of database integration essentially went dead in 1991 • HOWEVER, another side effect of IKRIS is that it is now possible to detect

semantic similarities among databases even when there are different representations of the data as entity, attribute and data values – it won’t be perfect but it will be a lot better than we have.

• Additional work is starting on using ANSI Data Dictionary structures and populating them automatically.

+

Page 14

In Conclusion

• It is possible to increase Data Sharing in the government

• To enable enhanced automated discovery – Start with the Global Change Master Directory as a template and expand– Create new data descriptions– Use the English language correctly– Build process descriptions that show how and when data was generated– Use advanced Linguistic tools to extract data relationships– Integrate with a general knowledge base

• To overcome Schema Mismatch– Revisit old data models and carefully expand existing definitions to show the full

semantics of the data schema– Keep in mind that in the Real World one collects data samples of continuous

processes whereas the Social World records state changes. Individuals’ data combines both.

• There is no easy solution but advanced tools ensure hat any effort spent today is re-usable tomorrow and so there is no loss of value for investments in improving data descriptions.

Documents

Semantic Interoperability – Yes!