Horizontal Integration of Big Intelligence Data

HORIZONTAL INTEGRATION OF BIG INTELLIGENCE DATA

The Role of Ontology in the Era of Big Data

T. Malyuta, Ph. D New York City College of Technology, NY, NY

B. Smith, Ph. DUniversity at Buffalo, Buffalo, NY

R. Rudnicki CUBRC, Buffalo, NY

2

Big Data Problem• Wikipedia defines Big Data as “…a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools.”

• Gartner defines Big Data with three ‘V’s:• Volume• Velocity (of production and analysis)• Variety

• This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known)

3

Big Data Solution – Agility • Dimensions of agility

• Storage paradigms that accommodate massive volumes of heterogeneous data

• Data processing paradigms that can deal with the massive volumes of heterogeneous data coming onstream

• Dynamic data stores that can easily accommodate diverse and a priori unknown data types and semantics

• Methods and tools that leverage dynamic and diverse content

4

Agile Integration and Interoperability• Today, the main problem of the Big Data is using it • Utilization of ‘Variety’ – diverse types and semantics –

requires data integration and interoperability• Traditional integration approaches fail • Agile integration paradigms are needed

5

The Problem of Horizontal Integration of Big Intelligence Data

• HI =Def. the ability to exploit multiple data sources as if they are one

• Recognized issues for HI with existing approaches• Data silos• Lexicon/semantics silos

• Requirement for HI of Big Intelligence Data – Agile Semantic Interoperability A strategy for HI must be agile in the sense that it can be quickly

extended to new zones of emerging data according to need Ontology allows an incremental approach – big bang already from

the very first buck (we showed in I2WD) Ontology can provide the needed agility

6

Agile Semantic Interoperability• A good solution has to be

• Able to grow incrementally • Able to be developed in a distributed manner• Without losing consistency• Independent of particular implementations, and data producers and

consumers• Applicable to data in an agile manner

• We call our solution: ‘semantic enhancement’ (SE) of data

7

SE• SE is realized with the help of ontologies that are used to annotate

(tag) data • Vocabulary of ontologies used for annotations provides agile horizontal

integration• Ontologies, by virtue of their nature and organization, provide semantic

enhancement of data

PersonID Name Description

111 Java Programming

222 SQL Database

SQL

Java

C++

ProgrammingSkill

ComputerSkill

Skill Education

TechnicalEducation

8

The Meaning of ‘Enhancement’• Semantic enhancement/enrichment of data = arm’s length approach (no change to data) – through simple annotation we associate an entire knowledge system with a database field • enables analytics to process data, e.g. about computer skills,

“vertically” along the Skill hierarchy, as well as “horizontally” via relations between Skill and Education.

• and further… while data in the database does not change, its analysis can be richer and richer as our understanding of the reality changes

• For this richness to be leveraged by different communities, persons, and applications it needs to have the properties mentioned above and be constructed in accordance with the principles of the SE

9

SE Principles⁻ Create a Shared Semantic Resource (SSR) of ontologies to be used for annotation

⁻ Establish an agile strategy for building ontologies within this SSR, and apply and extend these ontologies to annotate new source data as they come onstream⁻ Strategy pioneered in biomedical and other scientific fields:

leaves data as they are, and incrementally tags data sources with terms from a growing, consistent, non-redundant set of ontologies

⁻ Problem: Given the immense and growing variety of data sources, the development methodology must be applied by multiple different groups⁻ How to manage collaboration?

10

Achieving the Goal• Methodology of incremental distributed ontology development

• A common ontology architecture incorporating a common, domain-neutral, upper-level ontology (BFO)

• A shared governance and change management process

• A simple, repeatable process for ontology development

• An ontology registry • A process of intelligence data capture through ‘annotation’ or ‘tagging’ of source data artifacts

11

Main Methodological Points• Ontological realism

• Based on Doctrine; • Involves SMEs in label selection and definition• Thoroughly tested*

• Arms-length process, with minimal disturbance to existing data and data semantics

• Reference ontologies – capture generic content and are designed for aggressive reuse in multiple different types of context• Single reference ontology for each domain of interest

• Application ontologies – are tied to specific local applications• An application ontology is created by combining local content with generic content

taken over from relevant reference ontologies• Are still interoperable as are based on the common set of reference

* Barry Smith and Werner Ceusters, “Ontological Realism as a Methodology for Coordinated Evolution of Scientific Ontologies”, Applied Ontology, 5 (2010), 139–188.

12

Arms-length Process

SE ontology labels

• Focusing on the terms (labels, acronyms, codes) used in ***our source data.

• Where multiple distinct terms {t1, …, tn} are used in separate data sources with one and the same meaning, they are associated with a single preferred label drawn from a standard set of such labels

• All the separate data items associated with the {t1, … tn} thereby linked together through the corresponding preferred labels.

• Preferred labels form basis the for the ontologies we build

Heterogeneous Contents

ABC

KLM

XYZ

13

Reference and Application Ontologies

vehicle =def: an object used for transporting people or goods

tractor =def: a vehicle that is used for towing

crane =def: a vehicle that is used for lifting and moving heavy objects

vehicle platform=def: means of providing mobility to a vehicle

wheeled platform=def: a vehicle platform that provides mobility through the use of wheels

tracked platform=def: a vehicle platform that provides mobility through the use of continuous tracks

artillery vehicle = def. vehicle designed for the transport of one or more artillery weapons

wheeled tractor = def. a tractor that has a wheeled platform

tracked tractor = def. a tractor that has a tracked platform

artillery tractor = def. an artillery vehicle that is a tractor

wheeled artillery tractor = def. an artillery tractor that has a wheeled platform

Reference Ontology Application Definitions

14

Illustration of Ontology Types (Toy Example)

Vehicle

Tractor

Wheeled Tractor

Artillery Tractor

Wheeled Artillery Tractor

Artillery Vehicle

Black – reference ontologies

Red – application ontologies

15

Role of Reference Ontologies• Normalized

• Maintains a set of consistent ontologies • Eliminates redundancy

• Modular• A set of plug-and-play ontology modules• Enables distributed consistent development

• Surveyable

16

SE Architecture• The Upper Level Ontology (ULO) in the SE hierarchy

must be maximally general (no overlap with domain ontologies)

• The Mid-Level Ontologies (MLOs) introduce successively less general and more detailed representations of types which arise in successively narrower domains until we reach the Lowest Level Ontologies (LLOs).

• The LLOs are maximally specific representation of the entities in a particular one-dimensional domain

17

Architecture Illustration

18

Current State• Completed

• Data Representation and Integration Framework (DRIF): architectural solution and implementation to create Dataspace (cloud of intelligence data)• Lossless representation of sources with their native semantics • Semantic Enhancement (SE): suite of prototype ontologies with

coverage allowing annotation of these native semantics• Index exposing the content of the Dataspace via SE with proven

benefits

• Methodology and architecture for ontology development

• In progress • Assembling the Shared Semantic Resource (SSR) as a separate

store and enabling its use outside the Dataspace; in discussions with various agencies

19

The SSR

for purposes of

Reference Ontologies (Shared Semantic Resource) Application Ontologies:

Agent-related Weapon-related …

Event Reporting

… Geospatial

Weapon

Information Artifact

DoD AirForce Navy NSA

Video Analysis

NLPIntelligence Analysis

Organization

use

Agent

20

Challenges to HI • Too many lexicons • The scope of the domain: signal, sensor, image, …

intelligence about … the whole world• Difficult to conduct governance and management of

ontology development to ensure consistent evolution• Lack of expertise

21

Preventing Failure• The method we use offers solutions to some of the common

reasons for failure• Lack of Consensus

• Realism offers an objective standard for settling disputes over terminology. Ontology development becomes an empirical science instead of an exercise in the publication of dialects

• Governance helps to resolve conflicts and achieve consensus

• High Maintenance• Arm’s length implementation places no additional overhead onto applications

• Parochialism• Architecture and methodology prevent development of vocabularies that

apply only to a single perspective

• Poor Quality• Experience prevents common mistakes in vocabularies that cause

downstream problems with search and analytics

Distributed Common Ground System – Army (DCGS-A)

Semantic Enhancement of the Dataspace on the Cloud

23

Integrated Store of Intelligence Data• Lossless integration without heavy pre-processing • Ability to:

• Incorporate multiple integration models / approaches / points of view of data and data-semantics

• Perform continuous semantic enrichment of the integrated store

• Scalability

24

Solution Components• Cloud implementation

• Cloudbase (Accumulo)

• Data Representation and Integration Framework• Comprehensive unified representation of data, data semantics, and

metadata

• This work was funded by US Army CERDEC Intelligence and Information Warfare Directorate (I2WD)

25

Dealing with Semantic Heterogeneity Physical Integration. A separate data store homogenizing semantics in a particular data-model – works only for special cases, entails loss and distortion of data and semantics, creates a new data silo.

Virtual integration. A projection onto a homogeneous data-model exposed to users – is more flexible, but may have the problem of data availability (e.g. military, intelligence). Also, a particular homogeneous model has limited usage, does not expose all content, and does not support enrichment

26

Pursuit of the Holy Grail of Intelligence Data Integration

•In a highly dynamic semantic environment evolving in ad hoc ways

• how to have it all and have it available immediately and at any time?• Traditional physical and virtual integration approaches fail to respond to

these requirements

• how to use these data resources efficiently (integrate, query, and analyze)?

27

Workable SolutionA physical store incorporating heterogeneous contents. Data Representation and Integration Framework (DRIF) – is based on a decomposed representation of structured data (RDF-style) and allows collection of data resources without loss and or distortion and thereby achieve representational integration

Light Weight Semantic Enhancement (SE) supports semantic integration and provides a decent utilization capability without adding storage and processing weight to the already storage- and processing-heavy Dataspace

28

DRIF Dataspace• Integration without heavy pre-processing (ad-hoc rapid integration):• Of any data artifact regardless of the model (or absence of it)

and modality• Without loss and or distortion of data and data-semantics

• Continuous evolution and enrichment• Pay-as-you-go solution

• While data and data-semantics are expected to be enriched and refined, they can be efficiently utilized immediately after entering the DataSpace through querying, navigation, and drilling

Organization of the DRIF Dataspace

RegistrationIngestionExtraction [Transformation] / Enrichment

30

Semantic Enhancement of the Dataspace• Simple yet efficient harmonization strategy

• Takes place not by changing the data semantics to which it is applied , but rather by adding an extra semantic layer to it

• Long-lasting solution that can be applied consistently and in cumulative fashion to new models entering the Dataspace

• Strategy compliant with and complementing the DRIF• Source data models are not changed

• Be used efficiently, and in a unified fashion, in search, reasoning, and analytics• Provides views of the Dataspace of different level of detail

• Mapping to a particular Über-model or choosing a single comprehensive model for harmonization do not provide the benefits described

31

Illustration• DRIF Dataspace accommodates lots of data models and

is a microcosm of a collection of systems with diverse and heterogeneous data

• Incremental annotations of these data models through SE ontologies

• Preserving the native content of data resources • Presenting the native content via the SE annotations• Benefits of the approach

32

Sources• Source database Db1, with tables Person and Skill,

containing person data and data pertaining to skills of different kinds, respectively.

• Source database Db2, with the table Person, containing data about IT personnel and their skills:

• Source database Db3, with the table ProgrSkill, containing data about programmers’ skills:

PersonID SkillID

111 222

SkillID Name Description

222 Java Programming

ID SkillDescr

333 SQL

EmplID SkillName

444 Java

33

Representation in the Dataspace

Value and Associated Label

Relation Value and Associated Label

111, Db1.PersonID hasSkillID 222, Db1.SkillID222, Db1.SkillID hasName Java, Db1.Name 222, Db1.SkillID hasDescription

Programming, Db1.Description

333, Db2.ID hasSkillDescr SQL, Db2.SkillDescr444, Db3.EmplID hasSkillName Java, Db3.SkillName

Label Relation SE LabelDb1.Name Is-a SE.Skill

Db2.SkillDescr Is-a SE.ComputerSkill

Db3.SkillName Is-a SE.ProgrammingSkill

Db1.PersonID Is-a SE.PersonID

Db2.ID Is-a SE.PersonID

Db3.EmplID Is-a SE.PersonID

SE.ComputerSkill Is-a SE.Skill

SE.ProgrammingSkill Is-a SE.ComputerSkill

Representation of data-models, SE and SE annotations as Concepts and ConceptAssociations

Blue – SE annotationsRed – SE hierarchies

Native representation of structured data

34

Indexed Contents Based on the SEIndex entries based on the SE and native (blue) vocabularies

Index Entry Associated Field-Value111, PersonID Type: Person

Skill: JavaDb1.Description:Programming

333, PersonID Type: PersonComputerSkill: SQL

444, PersonID Type: PersonProgrammingSkill: Java

35

Benefits of DRIF + SE• Leverages syntactic integration provided by DRIF, semantic

integration provided by the SE vocabulary and annotations of native sources, and rich semantics provided by ontologies in general• Entering Skill = Java (which will be re-written at run time as: Skill

= Java OR ComputerSkill = Java OR ProgrammingSkill = Java OR NetworkSkill = Java) will return: persons 111 and 444

• Entering ComputerSkill = Java OR ComputerSkill = SQL will return: persons 333 and 444

• entering ProgrammingSkill = Java will return: person 444• entering Description = Programming will return: person 111

• Allows to query/search and manipulate native representations• Light-weight non-intrusive approach that can be improved

and refined without impacting the Dataspace

36

Index Contents without the SE

Index Entry Associated Field-Value

111, PersonID Type: Person

Name: Java

Description: Programming

333, ID Type: Person

SkillDescr: SQL

444, EmplID Type: Person

SkillName: Java

Index entries based on native vocabularies

37

Problems• Even for our toy example we can see how much manual

effort the analyst needs to apply in performing search without SE – and even then the information he will gain will be meager in comparison with what is made available through the Index with SE.

• For example, if an analyst is familiar with the labels used in Db1 and is thus in a position to enter Name = Java, his query will still return only: person 111. Directly salient Db4 information will thus be missed.

38

Additional Notes on the SE process• Original data and data-semantics are included in the Dataspace

without loss and or distortion; thus there is no need to cover all semantics of the Dataspace – what is unlikely to be used in search or is not important for integration will still be available when needed

• A complex ontology is not needed – a common and shared vocabulary is sufficient for virtual semantic integration and search/analytics

• The approach is very flexible, and investments can be made in specific areas according to need (pay-as-you-go)

• The approach is tunable – if the chosen annotations of a particular subset of a source data-model are too general for data analyses, the respective ontologies can be further developed and source models re-annotated

39

Benefits of the Approach• Does not interfere with the source content• Enhancement enables this content to evolve in a cumulative

fashion as it accommodates new kinds of data• Does not depend on the data resources and can be developed

independently from them in an incremental and distributed fashion• Provides a more consistent, homogeneous, and well-articulated

presentation of the content which originates in multiple internally inconsistent and heterogeneous systems

• Makes management and exploitation of the content more cost-effective

• The use of the selected ontologies brings integration with other government initiatives and brings the system closer to the federally mandated net-centric data strategy

• Creates an integrated content that is effectively searchable and that provides content to which more powerful analytics can be applied

40

Towards Globalization and Sharing• Using the SE approach to

create a Shared Semantic Resource for the Intelligence Community to enable interoperability across systems

• Applying it directly to or projecting its contents on a particular integration solution

41

References• Smith B. et al.

Horizontal Integration of Warfighter Intelligence Data: A Shared Semantic Resource for the Intelligence Community, STIDS Conference, 2012.

• • Smith B. et al., “Ontology for the Intelligence Analyst”,

Crosstalk: The Journal of Defense Software Engineering, 2012.

• • Salmen D. et al.

Integration of Intelligence Data through Semantic Enhancement, STIDS Conference, 2011.

http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegrationOfWarfighterIntel.pdf

http://stids.c4i.gmu.edu/papers/STIDSPapers/STIDS2012_T14_SmithEtAl_HorizontalIntegrationOfWarfighterIntel.pdf

http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf

http://stids.c4i.gmu.edu/STIDS2011/papers/STIDS2011_CR_T1_SalmenEtAl.pdf

Follow Us

Data Tactics Corporation

7901 Jones Branch Dr.

Suite 700

McLean, VA 22102

www.data-tactics-corp.com

http://www.data-tactics-corp.com/

https://twitter.com/DataTactics

http://www.facebook.com/pages/Data-Tactics-Corporation/240715722615624

http://datatactics.blogspot.com/

http://www.flickr.com/photos/datatactics/

http://www.linkedin.com/company/data-tactics-corporation

http://www.youtube.com/user/DataTacticsCorp

Documents

Horizontal Integration of Big Intelligence Data