Upload
datatactics
View
115
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
HORIZONTAL INTEGRATION OF BIG INTELLIGENCE DATA
The Role of Ontology in the Era of Big Data
T. Malyuta, Ph. D New York City College of Technology, NY, NY
B. Smith, Ph. DUniversity at Buffalo, Buffalo, NY
R. Rudnicki CUBRC, Buffalo, NY
2
Big Data Problem• Wikipedia defines Big Data as “…a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.”
• Gartner defines Big Data with three ‘V’s:• Volume• Velocity (of production and analysis)• Variety
• This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known)
3
Big Data Solution – Agility • Dimensions of agility
• Storage paradigms that accommodate massive volumes of heterogeneous data
• Data processing paradigms that can deal with the massive volumes of heterogeneous data coming onstream
• Dynamic data stores that can easily accommodate diverse and a priori unknown data types and semantics
• Methods and tools that leverage dynamic and diverse content
4
Agile Integration and Interoperability• Today, the main problem of the Big Data is using it • Utilization of ‘Variety’ – diverse types and semantics –
requires data integration and interoperability• Traditional integration approaches fail • Agile integration paradigms are needed
5
The Problem of Horizontal Integration of Big Intelligence Data
• HI =Def. the ability to exploit multiple data sources as if they are one
• Recognized issues for HI with existing approaches• Data silos• Lexicon/semantics silos
• Requirement for HI of Big Intelligence Data – Agile Semantic Interoperability A strategy for HI must be agile in the sense that it can be quickly
extended to new zones of emerging data according to need Ontology allows an incremental approach – big bang already from
the very first buck (we showed in I2WD) Ontology can provide the needed agility
6
Agile Semantic Interoperability• A good solution has to be
• Able to grow incrementally • Able to be developed in a distributed manner• Without losing consistency• Independent of particular implementations, and data producers and
consumers• Applicable to data in an agile manner
• We call our solution: ‘semantic enhancement’ (SE) of data
7
SE• SE is realized with the help of ontologies that are used to annotate
(tag) data • Vocabulary of ontologies used for annotations provides agile horizontal
integration• Ontologies, by virtue of their nature and organization, provide semantic
enhancement of data
PersonID Name Description
111 Java Programming
222 SQL Database
SQL
Java
C++
ProgrammingSkill
ComputerSkill
Skill Education
TechnicalEducation
8
The Meaning of ‘Enhancement’• Semantic enhancement/enrichment of data = arm’s length approach (no change to data) – through simple annotation we associate an entire knowledge system with a database field • enables analytics to process data, e.g. about computer skills,
“vertically” along the Skill hierarchy, as well as “horizontally” via relations between Skill and Education.
• and further… while data in the database does not change, its analysis can be richer and richer as our understanding of the reality changes
• For this richness to be leveraged by different communities, persons, and applications it needs to have the properties mentioned above and be constructed in accordance with the principles of the SE
9
SE Principles⁻ Create a Shared Semantic Resource (SSR) of ontologies to be used for annotation
⁻ Establish an agile strategy for building ontologies within this SSR, and apply and extend these ontologies to annotate new source data as they come onstream⁻ Strategy pioneered in biomedical and other scientific fields:
leaves data as they are, and incrementally tags data sources with terms from a growing, consistent, non-redundant set of ontologies
⁻ Problem: Given the immense and growing variety of data sources, the development methodology must be applied by multiple different groups⁻ How to manage collaboration?
10
Achieving the Goal• Methodology of incremental distributed ontology development
• A common ontology architecture incorporating a common, domain-neutral, upper-level ontology (BFO)
• A shared governance and change management process
• A simple, repeatable process for ontology development
• An ontology registry • A process of intelligence data capture through ‘annotation’ or ‘tagging’ of source data artifacts
11
Main Methodological Points• Ontological realism
• Based on Doctrine; • Involves SMEs in label selection and definition• Thoroughly tested*
• Arms-length process, with minimal disturbance to existing data and data semantics
• Reference ontologies – capture generic content and are designed for aggressive reuse in multiple different types of context• Single reference ontology for each domain of interest
• Application ontologies – are tied to specific local applications• An application ontology is created by combining local content with generic content
taken over from relevant reference ontologies• Are still interoperable as are based on the common set of reference
* Barry Smith and Werner Ceusters, “Ontological Realism as a Methodology for Coordinated Evolution of Scientific Ontologies”, Applied Ontology, 5 (2010), 139–188.
12
Arms-length Process
SE ontology labels
• Focusing on the terms (labels, acronyms, codes) used in ***our source data.
• Where multiple distinct terms {t1, …, tn} are used in separate data sources with one and the same meaning, they are associated with a single preferred label drawn from a standard set of such labels
• All the separate data items associated with the {t1, … tn} thereby linked together through the corresponding preferred labels.
• Preferred labels form basis the for the ontologies we build
Heterogeneous Contents
ABC
KLM
XYZ
13
Reference and Application Ontologies
vehicle =def: an object used for transporting people or goods
tractor =def: a vehicle that is used for towing
crane =def: a vehicle that is used for lifting and moving heavy objects
vehicle platform=def: means of providing mobility to a vehicle
wheeled platform=def: a vehicle platform that provides mobility through the use of wheels
tracked platform=def: a vehicle platform that provides mobility through the use of continuous tracks
artillery vehicle = def. vehicle designed for the transport of one or more artillery weapons
wheeled tractor = def. a tractor that has a wheeled platform
tracked tractor = def. a tractor that has a tracked platform
artillery tractor = def. an artillery vehicle that is a tractor
wheeled artillery tractor = def. an artillery tractor that has a wheeled platform
Reference Ontology Application Definitions
14
Illustration of Ontology Types (Toy Example)
Vehicle
Tractor
Wheeled Tractor
Artillery Tractor
Wheeled Artillery Tractor
Artillery Vehicle
Black – reference ontologies
Red – application ontologies
15
Role of Reference Ontologies• Normalized
• Maintains a set of consistent ontologies • Eliminates redundancy
• Modular• A set of plug-and-play ontology modules• Enables distributed consistent development
• Surveyable
16
SE Architecture• The Upper Level Ontology (ULO) in the SE hierarchy
must be maximally general (no overlap with domain ontologies)
• The Mid-Level Ontologies (MLOs) introduce successively less general and more detailed representations of types which arise in successively narrower domains until we reach the Lowest Level Ontologies (LLOs).
• The LLOs are maximally specific representation of the entities in a particular one-dimensional domain
17
Architecture Illustration
18
Current State• Completed
• Data Representation and Integration Framework (DRIF): architectural solution and implementation to create Dataspace (cloud of intelligence data)• Lossless representation of sources with their native semantics • Semantic Enhancement (SE): suite of prototype ontologies with
coverage allowing annotation of these native semantics• Index exposing the content of the Dataspace via SE with proven
benefits
• Methodology and architecture for ontology development
• In progress • Assembling the Shared Semantic Resource (SSR) as a separate
store and enabling its use outside the Dataspace; in discussions with various agencies
19
The SSR
for purposes of
Reference Ontologies (Shared Semantic Resource) Application Ontologies:
Agent-related Weapon-related …
Event Reporting
… Geospatial
Weapon
Information Artifact
DoD AirForce Navy NSA
Video Analysis
NLPIntelligence Analysis
Organization
use
Agent
20
Challenges to HI • Too many lexicons • The scope of the domain: signal, sensor, image, …
intelligence about … the whole world• Difficult to conduct governance and management of
ontology development to ensure consistent evolution• Lack of expertise
21
Preventing Failure• The method we use offers solutions to some of the common
reasons for failure• Lack of Consensus
• Realism offers an objective standard for settling disputes over terminology. Ontology development becomes an empirical science instead of an exercise in the publication of dialects
• Governance helps to resolve conflicts and achieve consensus
• High Maintenance• Arm’s length implementation places no additional overhead onto applications
• Parochialism• Architecture and methodology prevent development of vocabularies that
apply only to a single perspective
• Poor Quality• Experience prevents common mistakes in vocabularies that cause
downstream problems with search and analytics
Distributed Common Ground System – Army (DCGS-A)
Semantic Enhancement of the Dataspace on the Cloud
23
Integrated Store of Intelligence Data• Lossless integration without heavy pre-processing • Ability to:
• Incorporate multiple integration models / approaches / points of view of data and data-semantics
• Perform continuous semantic enrichment of the integrated store
• Scalability
24
Solution Components• Cloud implementation
• Cloudbase (Accumulo)
• Data Representation and Integration Framework• Comprehensive unified representation of data, data semantics, and
metadata
• This work was funded by US Army CERDEC Intelligence and Information Warfare Directorate (I2WD)
25
Dealing with Semantic Heterogeneity Physical Integration. A separate data store homogenizing semantics in a particular data-model – works only for special cases, entails loss and distortion of data and semantics, creates a new data silo.
Virtual integration. A projection onto a homogeneous data-model exposed to users – is more flexible, but may have the problem of data availability (e.g. military, intelligence). Also, a particular homogeneous model has limited usage, does not expose all content, and does not support enrichment
26
Pursuit of the Holy Grail of Intelligence Data Integration
•In a highly dynamic semantic environment evolving in ad hoc ways
• how to have it all and have it available immediately and at any time?• Traditional physical and virtual integration approaches fail to respond to
these requirements
• how to use these data resources efficiently (integrate, query, and analyze)?
27
Workable SolutionA physical store incorporating heterogeneous contents. Data Representation and Integration Framework (DRIF) – is based on a decomposed representation of structured data (RDF-style) and allows collection of data resources without loss and or distortion and thereby achieve representational integration
Light Weight Semantic Enhancement (SE) supports semantic integration and provides a decent utilization capability without adding storage and processing weight to the already storage- and processing-heavy Dataspace
28
DRIF Dataspace• Integration without heavy pre-processing (ad-hoc rapid integration):• Of any data artifact regardless of the model (or absence of it)
and modality• Without loss and or distortion of data and data-semantics
• Continuous evolution and enrichment• Pay-as-you-go solution
• While data and data-semantics are expected to be enriched and refined, they can be efficiently utilized immediately after entering the DataSpace through querying, navigation, and drilling
Organization of the DRIF Dataspace
RegistrationIngestionExtraction [Transformation] / Enrichment
30
Semantic Enhancement of the Dataspace• Simple yet efficient harmonization strategy
• Takes place not by changing the data semantics to which it is applied , but rather by adding an extra semantic layer to it
• Long-lasting solution that can be applied consistently and in cumulative fashion to new models entering the Dataspace
• Strategy compliant with and complementing the DRIF• Source data models are not changed
• Be used efficiently, and in a unified fashion, in search, reasoning, and analytics• Provides views of the Dataspace of different level of detail
• Mapping to a particular Über-model or choosing a single comprehensive model for harmonization do not provide the benefits described
31
Illustration• DRIF Dataspace accommodates lots of data models and
is a microcosm of a collection of systems with diverse and heterogeneous data
• Incremental annotations of these data models through SE ontologies
• Preserving the native content of data resources • Presenting the native content via the SE annotations• Benefits of the approach
32
Sources• Source database Db1, with tables Person and Skill,
containing person data and data pertaining to skills of different kinds, respectively.
• Source database Db2, with the table Person, containing data about IT personnel and their skills:
• Source database Db3, with the table ProgrSkill, containing data about programmers’ skills:
PersonID SkillID
111 222
SkillID Name Description
222 Java Programming
ID SkillDescr
333 SQL
EmplID SkillName
444 Java
33
Representation in the Dataspace
Value and Associated Label
Relation Value and Associated Label
111, Db1.PersonID hasSkillID 222, Db1.SkillID222, Db1.SkillID hasName Java, Db1.Name 222, Db1.SkillID hasDescription
Programming, Db1.Description
333, Db2.ID hasSkillDescr SQL, Db2.SkillDescr444, Db3.EmplID hasSkillName Java, Db3.SkillName
Label Relation SE LabelDb1.Name Is-a SE.Skill
Db2.SkillDescr Is-a SE.ComputerSkill
Db3.SkillName Is-a SE.ProgrammingSkill
Db1.PersonID Is-a SE.PersonID
Db2.ID Is-a SE.PersonID
Db3.EmplID Is-a SE.PersonID
SE.ComputerSkill Is-a SE.Skill
SE.ProgrammingSkill Is-a SE.ComputerSkill
Representation of data-models, SE and SE annotations as Concepts and ConceptAssociations
Blue – SE annotationsRed – SE hierarchies
Native representation of structured data
34
Indexed Contents Based on the SEIndex entries based on the SE and native (blue) vocabularies
Index Entry Associated Field-Value111, PersonID Type: Person
Skill: JavaDb1.Description:Programming
333, PersonID Type: PersonComputerSkill: SQL
444, PersonID Type: PersonProgrammingSkill: Java
35
Benefits of DRIF + SE• Leverages syntactic integration provided by DRIF, semantic
integration provided by the SE vocabulary and annotations of native sources, and rich semantics provided by ontologies in general• Entering Skill = Java (which will be re-written at run time as: Skill
= Java OR ComputerSkill = Java OR ProgrammingSkill = Java OR NetworkSkill = Java) will return: persons 111 and 444
• Entering ComputerSkill = Java OR ComputerSkill = SQL will return: persons 333 and 444
• entering ProgrammingSkill = Java will return: person 444• entering Description = Programming will return: person 111
• Allows to query/search and manipulate native representations• Light-weight non-intrusive approach that can be improved
and refined without impacting the Dataspace
36
Index Contents without the SE
Index Entry Associated Field-Value
111, PersonID Type: Person
Name: Java
Description: Programming
333, ID Type: Person
SkillDescr: SQL
444, EmplID Type: Person
SkillName: Java
Index entries based on native vocabularies
37
Problems• Even for our toy example we can see how much manual
effort the analyst needs to apply in performing search without SE – and even then the information he will gain will be meager in comparison with what is made available through the Index with SE.
• For example, if an analyst is familiar with the labels used in Db1 and is thus in a position to enter Name = Java, his query will still return only: person 111. Directly salient Db4 information will thus be missed.
38
Additional Notes on the SE process• Original data and data-semantics are included in the Dataspace
without loss and or distortion; thus there is no need to cover all semantics of the Dataspace – what is unlikely to be used in search or is not important for integration will still be available when needed
• A complex ontology is not needed – a common and shared vocabulary is sufficient for virtual semantic integration and search/analytics
• The approach is very flexible, and investments can be made in specific areas according to need (pay-as-you-go)
• The approach is tunable – if the chosen annotations of a particular subset of a source data-model are too general for data analyses, the respective ontologies can be further developed and source models re-annotated
39
Benefits of the Approach• Does not interfere with the source content• Enhancement enables this content to evolve in a cumulative
fashion as it accommodates new kinds of data• Does not depend on the data resources and can be developed
independently from them in an incremental and distributed fashion• Provides a more consistent, homogeneous, and well-articulated
presentation of the content which originates in multiple internally inconsistent and heterogeneous systems
• Makes management and exploitation of the content more cost-effective
• The use of the selected ontologies brings integration with other government initiatives and brings the system closer to the federally mandated net-centric data strategy
• Creates an integrated content that is effectively searchable and that provides content to which more powerful analytics can be applied
40
Towards Globalization and Sharing• Using the SE approach to
create a Shared Semantic Resource for the Intelligence Community to enable interoperability across systems
• Applying it directly to or projecting its contents on a particular integration solution
41
References• Smith B. et al.
Horizontal Integration of Warfighter Intelligence Data: A Shared Semantic Resource for the Intelligence Community, STIDS Conference, 2012.
• • Smith B. et al., “Ontology for the Intelligence Analyst”,
Crosstalk: The Journal of Defense Software Engineering, 2012.
• • Salmen D. et al.
Integration of Intelligence Data through Semantic Enhancement, STIDS Conference, 2011.
Follow Us
Data Tactics Corporation
7901 Jones Branch Dr.
Suite 700
McLean, VA 22102
www.data-tactics-corp.com