Upload
elie
View
80
Download
0
Embed Size (px)
DESCRIPTION
Semantic Interoperability – Yes!. Presentation to the CIO Council June 18th 2007 Lucian Russell, Ph.D. Semantic Interoperability. What is it? The Data Reference Model Version 2.0 states: 3.2. Introduction 3.2.1. What is Data Description and Why is it Important … - PowerPoint PPT Presentation
Citation preview
Page 1
+ Semantic Interoperability – Yes!
Presentation to the CIO CouncilJune 18th 2007
Lucian Russell, Ph.D.
+
Page 2
Semantic Interoperability
• What is it?
• The Data Reference Model Version 2.0 states:– 3.2. Introduction– 3.2.1. What is Data Description and Why is it Important …
– Semantic Interoperability: Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enables enhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data.
• Semantic Interoperability is a condition that is created with respect to a Data Resource that is under the control of an Agency.
– Associated with each Data Resource is another that allows a “reasoning service” to identify its “semantics” to determine its value w.r.t. a query
– Left in the table: how are “reasoning services” created and what are the necessary additional data resources needed?
+
Page 3
In 2005 there was no direct answer, only a template
• See DRM Version 2.0 Chapter 2, Figure 2.5 – Digital Data Resources can be Structured, Semi-Structured or Unstructured and
can be contained within a document. These can describe a Data Asset. – On the other hand a Data Asset can provide a management context for a Digital
Data Resources.– Topics in a language can categorize either (i.e. they are instances of a class
designated by the topic word.)
• To support “enhanced automated discovery”, though we need to use a combination and constellation of some collection of instances of these three entities.
• Interoperability would then depend on the adequacy of the combination.
• In 2005 the way was unclear, but there was a template.
• On Page 18 “…Data Description artifacts are an output of the process of providing data syntax and semantics and a meaningful identification for a data resource so as to make it visible and usable by a COI.”
• The most effective government COI is the Global Change Master Directory
• The GCMD indexes 18 petabytes of multi-agency data: it was the template
+
Page 4
In 2006 there were several breakthroughs
• The unclassified R&D sponsored by the Intelligence Community had several important breakthroughs that impact “enhanced discovery services”
– AQUAINT – Advanced Question Answering for Intelligence WordNet was enhanced to create a disambiguated description of the most common
words in the English Language, some 115,000 words and their meanings. A markup language for time, TIMEML An extraction technique to parse English Language text and create logical relations
– NIMD – Novel Intelligence from Massive Data (FOUO) Released (not-FOUO) the slides announced a breakthrough from the IKRIS project Interoperable Knowledge Representation for Intelligence Support (IKRIS)
• The IRIS Project’s Challenge:– “How to enable interoperability of knowledge representation and reasoning
(KR&R) technology developed by multiple organizations in multiple DTO programs and designed to perform different tasks”
• The Results:– A new language IKL that translates among knowledge representation languages– An extension of logic to 2nd order and non-monotonic expressions – A proof of equivalence among process specifications
+
Page 5
These results open the way to SI using - English!
• The implication of the results are staggering – English Descriptions in documents can be used to enable enhanced automated discovery.
• There were limitations on concepts that could be represented:– Prior “semantic” technology (e.g. OWL-DL) only allowed for precise descriptions
of concepts represented by nouns, i.e. taxonomies. “Ontologies” were defined as overlapping taxonomies.
– WordNet now allows nouns to be unambiguously described.– WordNet has clearly demonstrated that nouns have single-subtype taxonomies
but verbs do not: because there is a time element in all verbs’ meanings they have four sub-classes (verbs describe 4-D motions or state changes).
– Consequently nouns and verbs cannot be intermixed meaningfully (without inconsistency) in OWL-DL Ontologies”.
– Representing concepts using verbs entails describing processes, which are multiple verbs in a “Part-of” (meronymic/holonymic) relationships.
– English descriptions of processes were imprecise because relative time concepts were heretofore too poorly understood to support automation.
• With WordNet and TIMEML we can now precisely describe the processes that create and change data as well as the nouns used for the real world.
+
Page 6
TimeML
• Markup Language for Temporal and Event Expressions
• TimeML is a robust specification language for events and temporal expressions in natural language. It is designed to address four problems in event and temporal expression markup:
– (1) Time stamping of events (identifying an event and anchoring it in time); – (2) Ordering events with respect to one another (lexical versus discourse
properties of ordering); – (3) Reasoning with contextually underspecified temporal expressions (temporal
functions such as 'last week' and 'two weeks before'); – (4) Reasoning about the persistence of events (how long does an event or the
outcome of an event last).
• The rules that identify temporal dependencies can be used to insert tags into text. These can be processed.
• Processes that entail other sub-processes can also be processed logically, i.e. infer from “A filed an application” the fact that “A filled out an application”.
• Language Computer Corporation (AQUANT) finds logical relations in text
+
Page 7
LCC Product Polaris Semantic Relations
# Semantic Relation Abbr
1 POSSESSION POS
2 KINSHIP KIN
3 PROPERTY-ATTRIBUTE HOLDER PAH
4 AGENT AGT
5 TEMPORAL TMP
6 DEPICTION DPC
7 PART-WHOLE PW
8 HYPONYMY ISA
9 ENTAIL ENT
10 CAUSE CAU
11 MAKE-PRODUCE MAK
12 INSTRUMENT INS
13 LOCATION-SPACE LOC
14 PURPOSE PRP
15 SOURCE-FROM SRC
16 TOPIC TPC
17 MANNER MNR
18 MEANS MNS
19 ACCOMPANIMENT-COMPANION ACC
20 EXPERIENCER EXP
# Semantic Relation Abbr
21 RECIPIENT REC
22 FREQUENCY FRQ
23 INFLUENCE IFL
24 ASSOCIATED-WITH / OTHER OTH
25 MEASURE MEA
26 SYNONYMY-NAME SYN
27 ANTONYMY ANT
28 PROBABILITY-OF-EXISTENCE PRB
29 POSSIBILITY PSB
30 CERTAINTY CRT
31 THEME-PATIENT THM
32 RESULT RSL
33 STIMULUS STI
34 EXTENT EXT
35 PREDICATE PRD
36 BELIEF BLF
37 GOAL GOL
38 MEANING MNG
39 JUSTIFICATION JST
40 EXPLANATION EXN
+
Page 8
• LCC’s Jaguar product can automatically generate ontologies and structured knowledge bases from text
– Ontologies form the framework or “skeleton” of the knowledge base– Rich set of semantic relations form the “muscle” that connects concepts in the knowledge base
IS-A IS-A
carry
AGENT
conduct
THEME
board
AGENTTHEME
board
MEANS
ship
transport
MEA
NS AGENT
arrive
run
stop
Joined train
passenger trainfreight train
LCC’s Jaguar: Knowledge Extraction
+
Page 9
It is now Cost Effective to “Document” Databases!
• Previously documentation of databases was a black hole for budget $$– Only people would read the documentation– It was never kept up to date– Rules within it “evolved” over time– Hence people never read the documentation anyway and the data was
inconsistent– ETL techniques, Data warehouses and Data Marts were used to get uniformity,
but substituting computer generated data for stored data is no guarantee of accuracy.
• Now text descriptions of databases can be processed automatically– The correct WordNet sense of each word can be used. A correct description of
the relationships among data attributes and the processes that describe how they were created can now be used for semantic processing.
– The text can be extracted and used to create knowledge repositories!
• AQUAINT and NIMD also enhanced the CYC Knowledge Base– CYCORP has the world’s largest general ontology and knowledge base
describing the real world. It can be extended and used for Interoperability.
+
Page 10
+
Page 11
How can this be done? Carefully!
Real World Data
Mathematical Patterns
Social World Data
Data about Individuals
Data are samples
Data are State
Changes
Data are Both
Old fashioned 1970s Data
Modeling destroys distinctions: Lost
Gold!
Gray mass of
sameness
+
Page 12
Look at each type of data and how it comes into being!
Example: A USCIS form has 10 Object types
Photograph
Signature
Fingerprints
1: Data
Elements:
Name & Country
of
Citizenship
2: Data
Elements:
Identification
Numbers
3: Data
Elements:
Residence
History
6: Data
Elements:
Arrivals &
Departures
7: Data
Elements:
Arrests &
Citations
8: Data
Elements:
Marital
Information
9: Data
Elements:
Children’s
Names
10: Data
Elements:
Parents
Country
of
Citizenship
5: Data
Elements:
Employment
History
4: Data
Elements:
Education
History
+
Page 13
Structured Data and Schema Mismatch
• Syntactic Schema Mismatch:– IEEE Computer December 1991 showed a large number of syntactic mismatches
among representations of data were a barrier to data integration or sharing.
• Entities = Attributes = Data Values – Nonsense or Computer Science?• Computer Science: Semantic Schema Mismatches
– In 1986 it was published in Computing Surveys that when looking at how to integrate databases we could see that one Database’s Entity was another Database’s Attribute
– In 1991 a research result showed that an Attribute in one Database could be a Data Value in another Database
• So, with a potential for this degree of mismatch sending XML schemas to a repository is not necessarily a help to semantic interoperability.
• The field of database integration essentially went dead in 1991 • HOWEVER, another side effect of IKRIS is that it is now possible to detect
semantic similarities among databases even when there are different representations of the data as entity, attribute and data values – it won’t be perfect but it will be a lot better than we have.
• Additional work is starting on using ANSI Data Dictionary structures and populating them automatically.
+
Page 14
In Conclusion
• It is possible to increase Data Sharing in the government
• To enable enhanced automated discovery – Start with the Global Change Master Directory as a template and expand– Create new data descriptions– Use the English language correctly– Build process descriptions that show how and when data was generated– Use advanced Linguistic tools to extract data relationships– Integrate with a general knowledge base
• To overcome Schema Mismatch– Revisit old data models and carefully expand existing definitions to show the full
semantics of the data schema– Keep in mind that in the Real World one collects data samples of continuous
processes whereas the Social World records state changes. Individuals’ data combines both.
• There is no easy solution but advanced tools ensure hat any effort spent today is re-usable tomorrow and so there is no loss of value for investments in improving data descriptions.