Information Exploitation at BBN

Raytheon BBN Technologies

From Knowledge Management to

Information Exploitation

Dr. Plamen Petrov

[email protected]

March, 2015

© 2015, Raytheon BBN Technologies

Topics

• Customer-driven direction

• Semantic Web technology & tools

– Query federation - Asio™

– Efficient storage - Parliament™ and SHARD

• Applications

– Data integration, co-reference resolution, provenance

– Geospatial and temporal reasoning

– Orchestrated intel collection

• Beyond Semantic Web

– Pattern detection

– Structural semantic similarity

– Clustering, approximate optimization, other tools

BBN provides a wealth of capabilities and tools paired

with deep expertise in Semantic Web and data

integration.


2

Motivation & New Challenges

• Strength: Historically, we are an applied Semantic Web

group specializing in data integration & knowledge mgmt

• Goal: Apply our expertise to a wider set of information

analysis problems

– Wide variety of graph-related problems

– Structured and unstructured data

– Larger datasets

• Reasons

– Emphasis of recent projects and new funding opportunities has

expanded beyond data integration and knowledge management

– Our group’s capabilities and potential have expanded beyond

the narrow area historically associated with the Semantic Web


3

Historic and Current Capabilities

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery

Knowledge Management


Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

Asio™ Parliament™ SILK

SSDM

Cross-Entropy

S*QL Cytoscape

ISSL/PINT

• Existing tools and algorithms to address a spectrum of needs

• Focus on scalability, reusability, and semantic graph analytics

Current Technologies Overlay


4

Customer-driven Intuition

• Our customers’ needs are expanding beyond the blue boxes

• We have developed expertise to contribute to these new areas

• Working Knowledge Management and Info Exploitation together

provides opportunities to distinguish us from competitors

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g


5

SEMANTIC WEB TECHNOLOGIES

Technology Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

Asio™ Parliament™ SILK


6

Key BBN-Led Semantic Initiatives

SemWebCentral.org

asio.bbn.com

W3C OWL Recommendation

2000 2001 2002 2003 2004 2005 2006 2007 2008

FCG (AFRL/AMC)

NOTAMS (AFRL/AMC)

Horus (DARPA/IMO)

Combine/APSTARS

IEII (DARPA)

ICEWS (DARPA)

2009 2010

DAML Integration & Transition (DARPA)

JFP ACTD (JFCOM)

GARCON-F (NGA)

Geospatial SW (NGA)

Integrated Learning (DARPA)

DIESL (DARPA)

SASSI/MMON

COBRA

BBN Hosts

ISSL

PINT

Machine Reading (DARPA)

KAHT (DARPA)

2011

RC2 (DARPA)

TASM (AFRL/AMC)

Omega (ONI)

SILK (commercial)

BCBS (commercial)

SHARD

Parliament™ Open Sourced

BBN Hosts SMW Conf.

DAML DB Parliament™

2012

Asio™ Scout

SID REX

CyberOnt

Multi-INT Fusion (LM)

7


Experience Highlights

Data Integration Asio™, Parliament™

JFP, COBRA, SID, ISSL…

Foundational Work

Reasoning

•Temporal

•Geospatial

•Logic

Ontology Development

NLP and Text Extraction

Open System

• Standards

• Open Source

• DARPA DAML

• DAML DB

• Combine/APSTARS

• Geospatial SW

• GARCON-F

• SID

• SILK

• Combine/APSTARS

• SID

• GARCON-F

• COBRA, CyberOnt

• RC2

• W3C

• OGC

• SemWebCentral

• Open Ontology

Repository

• Machine Reading

• Topic Detection

• Entity/Relation Extraction

8


Active Participation in SemWeb

• World Wide Web Consortium (W3C)

• Open Geospatial Consortium (OGC)

• US Geospatial Intelligence Foundation (USGIF)

• Terra Cognita workshops on Geospatial SemWeb

• Conferences and publications

– Semantic Web Programming book

– Organized ISWC 2009

– Presence/sponsorship at major SemWeb conferences

– Continuous involvement in STIDS since its inception

• Open Source Community

– Maintain SemWebCentral.org

– Contribute to open source projects

9


http://usgif.org/

PARLIAMENT TRIPLE STORE

Technology Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

Parliament™


10

Parliament™ Triple Store

[ Parliament: n. A group of owls. ]

Overview • RDF Triple Store with SPARQL query support

– First developed under DARPA DAML (DAML DB)

– In continuous use by customers for ~8 years

• Released as open source project in June 2009 – http://parliament.semwebcentral.com/

• Based on W3C standards for the Semantic Web – RDF, RDFS, OWL, SPARQL

Differentiators • Balanced query, insert, and space performance through a unique index structure

• Fast forward-chaining RDFS inference engine

• State-of-the-art optimization of complex queries – Hundreds of triple patterns (equivalent to 30+ joins)

• Efficient temporal and geospatial queries without proprietary SPARQL extensions

• Support for efficient reification 11


Parliament – Layered Architecture

• Supports both Jena and Sesame

– Jena is highly customized to take advantage of indexing for Complex Queries

• Storage and Rule engine implemented in C++ for performance

Java Native Interface

Parliament

Jetty + Joseki

Jena

Jena Graph for Parliament

Rule

Engine

Operating System Berkeley DB

Sesame SAIL for

Parliament

Sesame Jena Graph for Indexes

Named Graph Support

Temporal Index

Geospatial

Index

SPARQL

SERQL

C++

Java

Third-Party Components

Parliament Components

12


Parliament Statement Table

Each entry (statement) contains:

• Three resource ID fields: Subject, predicate,

and object of the statement

• Three statement ID fields: Next statements

using the same resource as subject, predicate,

and object

• Bit-field flags encoding statement attributes

• Recently: a statement id field to support

reification

13


Parliament Resource Dictionary

Each entry (resource) contains:

• Bidirectional string-to-ID mapping

• Three statement ID fields: First statements

using this resource as subject, predicate, and

object

• Three count fields: Numbers of statements

using this resource as subject, predicate, and

object

• Bit-field flags encoding resource attributes

14


Parliament’s Index Structure

• Dynamic applications often require efficient statement insertion (as opposed to bulk loading)

• Goal: Balanced insertion, query performance, and storage space required

• Parliament stores triples using two components: – Resource dictionary

– Statement table

• Parliament optimizes queries using: – Subject, Predicate, and Object indexes and size

counts

– These are maintained virtually for free

15


Parliament Performance

• Parliament maintains excellent query performance

for complex queries* while significantly increasing

throughput and decreasing space requirements *Queries equivalent to 30+ SQL joins

• Current and future work includes:

– Parallelization to a cloud architecture: Hadoop/Accumulo

– Query optimization strategies in a distributed architecture

– Analysis of Parliament’s internal rule engine

– Further optimizations to the native storage structure

16


SHARD: Triple-Store Built on

Prioritized Goals

• Commodity hardware ONLY

• Highly scalable

• Decentralized computing

• Robust to node failures

Design Considerations For

• SPARQL

• Complex queries

• Large query responses

=> Distributed Query Optimization

Functional Overview of SHARD proof-of-concept • Method calls at client

• Clause-iteration via Map-Reduce jobs

• Iterate over query clauses to find partial query matches

• Join partial query responses with flagged keys

• Move results to local machine for local drill-down

• Hadoop abstraction layer manages partial system failures with storage and computation redundancy

17 © 2015, Raytheon BBN Technologies

SHARD v1 Benchmarking

Query Type SHARD v01 Parliament +Sesame*

Parliament +Jena*

Simple Query, Small Result Set (Query 1)

404 sec. (~0.1 hr.)

0.1hr 0.001hr

Triangular Query (Query 9)

740 sec.

(~0.2 hr.) 1hr 1hr

Simple Query, Large Result Set(Query 14)

118 sec.

(~0.03 hr.)

1hr 5hr

s o p

o

s o

s o p

*K. Rohloff, M. Dean, I. Emmons, D. Ryder, J. Sumner. “An Evaluation of Triple-Store Technologies for

Large Data Stores.” 3rd International Workshop On Scalable Semantic Web Knowledge Base

Systems (SSWS '07), 2007.

• Deployed code on Amazon EC2 cloud.

– 19 XL nodes.

• LUBM (Lehigh Univ. BenchMark)

– Artificial data on students, professors, courses, etc… at universities

• 800 million edge graph, 6000 LUBM university dataset

• In general, performed favorably to “industrial” monolithic triple-stores

18


Key Design Issues, From Experience

• Data partitioning is key

– Standard hash partitioning limits scalable performance

because it makes decentralized processing almost

impossible

– Decentralized processing more efficient if related data

is colocated

– There are provenance and fault tolerance issues

• Indexing vs. full data scans

– Lack of indexing enables more efficient data ingest and

scans, but severely limits look-ups

– Need to be sensitive to applications

– Possible win through partial distributed indexing

19


Future Triple Store Requirements

• Customer needs

– Highly scalable and cost-effective graph data storage and

efficient inferencing/querying

– Operate over standard infrastructure such as Hadoop/HDFS,

interface with Accumulo/CloudBase

– Use through standards-based APIs: SPARQL, REST, SOAP

– Comply with Jena, Sesame, or BlackBook

– Scalable to very large graphs, yet agile for high-throughput

• Goals for next generation of Parliament and SHARD

– Redesign Parliament’s optimizations for a cloud-based

architecture to support customer mission objectives

– Support large graph data ingests, inferencing, and efficient

querying on commodity Hadoop/Accumulo deployment

20


SEMANTIC QUERY FEDERATION

Technology Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

Asio™


21

Overview

A set of runtime software and development tools designed to address the requirement for robust data federation across disparate sources. Based on World Wide Web Consortium (W3C) industry standards, the Asio™ tool suite allows systems and users to reach a shared understanding about the meaning, structure, and context of the data exchanged.

Differentiators

• A very expressive mapping language gives freedom in connecting data sources to domain ontology

• Advanced execution engine delivers query results quickly

• Workbench components significantly reduce configuration time and effort

• Asio™ has been in active development since 2006; v.1 in 2007

http://asio.bbn.com/

Asio™ Tool Suite

22


Architectural View

Web

Service

WSDL

WSDL

Ontology

OWL

Mapping

Ontology

OWL

SWRL Rules

RDBMS

Domain Source

OntologyOWL

Query: SPARQL

Data Source

Ontology

OWL

Data Source

Ontology

OWL

Semantic Bridge

Rel. Database

Semantic Bridge

Web Service

Snoggle

Semantic Query Decomposition (SQD)

Semantic Bridge

SPARQL Endpoint

Automapper

RDF

Triple Store

SQL REST/SOAP SPARQL

SPARQL SPARQL SPARQL

tm

Asi

o™

Wo

rkb

ench

23

Domain

Ontology OWL


RDBMS One

Web Service

SPARQL Endpoint

RDBMS Two

SPARQL Query

1

Query Decomposition

2

4 Data Access

6

Query Result Set

Semantic Query Decomposition (SQD)

Semantic Bridge Rel. Database

5

Backwards Rule Chaining

3 Generation of Sub Queries

Semantic Bridge Rel. Database

Semantic Bridge SPARQL Endpoint

Semantic Bridge Web Service

Asio™ Federated Query

24


Asio™ Strengths

• Very expressive mapping language (SWRL with functional extensions) – Allows significantly more mapping power than just OWL

mappings (equivalentClass, subClass, equivalentProperty, etc.)

– Truly independent domain ontology

– Required for real data unification

• Efficient streaming result sets – UIs need answers quickly

– Lower memory requirements mean more concurrent users

• Advanced execution engine – Ontology reasoning for reduced intermediate data set sizes

– SWRL builtin to SQL function converter pushes filtering to low levels

– Configurable batch size to handle higher latency situations

25


Why Mapping Language Matters

• Real world denormalized database schemas do

not match well with ontologies:

ID Name Dept Name Dept Location

1 John Smith

Finance B1

2 Jane Doe

Contracts B1

3 Steve Stevens

IT B2

4 Tom Thomas

IT B2

Employee

Department

Building string

This mapping requires creating a new entity, Building, from the Person table row.

26



• Sometimes a good schema can be stretched too

far:

ID Name Manager

1 John Smith Stephanie Fox

2 Jane Doe John Smith; Tom Thomas

3 Steve Stevens John Smith; Frank Foster

Jane Doe

John Smith

Tom Thomas

This mapping requires splitting the “manager” column into multiple parts by invoking procedural extensions.

27



• Even when the schema is reasonable, it may not

be the way the domain should be modeled:

ID Name Title

1 John Smith Vice President

2 Jane Doe Director

3 Steve Stevens Director

Employee

Manager

Director Vice

President

The type of the entity in the row must be mapped from a string value in a column.

subclassOf

28


Asio™ “Streaming” Result Sets

• Forming federated query results can be

complicated, and require many intermediate

queries

• User Interfaces still need to start showing

progress quickly, even if there are hundreds or

thousands of results

Solution: filter and process individual results on the

fly, and start returning them immediately

* Some queries, like those that involve DISTINCT, COUNT, or ORDER

BY, will still need to be fully processed before returning an answer

29



30

A B C D D E F G G H I J K L L M N O

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

Streaming result sets start feeding query results to the user interface immediately.



31


………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..




32


………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..




33


………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..

………………..



Asio™ Advanced Execution Engine

• Query Rewriting creates Efficient Subqueries – Ontology Reasoning

Disjoint Classes

• Liberal use of disjointness statements in the ontology help to reduce generated UNIONS in certain domain ontology situations

• Pairwise disjointness can be asserted automatically for some data source ontologies

Functional / Inverse Functional Properties

• Many unbound variables introduced in the rule expansion stage are unified

– SWRL Builtin <-> SPARQL Filter <-> SQL WHERE equivalence

• Configurable for various data source distributions – Streaming block size can be changed for different latency

situations

34


Asio™ - Other Features

• Subclass/Subproperty reasoning – This creates more possibilities for inferring query

statements (in the way that you would expect)

• No need to do mappings on a per-query basis – Mappings are written from data source to domain

– If a data source is removed, mappings written for other sources remain unchanged

• Multiple User Perspectives and Access Control – Distinct sets of mapping rules are defined on a per-user or

per-group basis

– This allows different users or groups to have different perspectives on the same data

– Provides access control through available mappings

35


Asio™ Workbench

• Workbench significantly reduces time and effort for Asio™ configuration – Works within Eclipse development environment

– Second generation implementation with Wizards and XML views

– Being deployed with Asio™ for Raytheon-internal data management 36

Integrated management of

data sources

Source code view allows editing of XML

Specialized ASIO perspective

Configuration file access

DB Connection Wizard

Test Connection Immediately


36

Asio™ Performance

• Preprocessing of a query takes milliseconds

• “Streaming” results means that you start getting query

answers back quickly, even if there are many results

– This improved our performance by two orders of magnitude

• The processing has very little overhead over just executing

the queries

• Many stateless instances can run on the same machine

37


Needs Beyond a Stateless Architecture

• Caching results

– The source data may not be always reachable

– Caching should improve performance

• Storing metadata

– For co-reference resolution

– For provenance

– Other “persisting” annotations

• Medium- to High- performance Triple Stores

needed

– Depending on implementation strategy

– Query performance optimization may be more

valuable than sheer volume of data stored

38


DATA INTEGRATION,

CO-REFERENCE RESOLUTION

Technology Applications


39

Data Integration Examples: SID & ISSL

• Semantically Integrated Databases (SID)

– Challenge: data for risk analysis scattered among five different

RDBs

– Solution:

• Semantic alignment of data using Asio™

• Non-destructive co-reference resolution of instance data

• Federated data tagged with provenance & time-interval of validity

• Integrated Semantic Search Layer (ISSL)

– Challenge: many disparate RDBs contain the data necessary to

correlate and identify activities of interest

– Solution:

• Semantic alignment of data using Asio™ to a domain ontology

• Integration between RDBMS and RDF data sources

• Streaming federated queries across all databases, while

maintaining provenance information

40


Data Integration Examples - COBRA

• Collaborative Ontology-Based Reasoning Architecture

– Create a unified common data environment

– Integrate disparate data sources; use “analysis case” as a container

– Identify and resolve co-references using rules, recommendations & human experts

– Provide flexible data mapping to a domain ontology & a shared knowledge base

– Data is queried and used in multiple analytical views via application plugins

– Ontology drives the User Interface

Flat Files

RDBMS

Data Integration

Convert to OWL/RDF

Map to common Domain

Ontology

Track provenance

Result: Harmonized

and integrated data

expressed in the end-

users’ vocabulary

Rapid integration of

analysis teams’ ad hoc

files

Knowledge Base

Common Data Environment

INT reports

Co-Reference Resolution

Identify duplicates

Choose versions of

attribute values

Result: Clean data,

ready for analysis

Non-destructive merge

and unmerge

Web Service

Live Feeds

41 Plug-ins

COTS

Analysis

Tools


Data Ingest & Deconfliction UIs

42

• Although we focus mainly on back-end components,

we have developed, deployed, and integrated with

many sophisticated user interfaces

– Emphasis on usability, working closely with customers

– Experience with multiple UI toolkits, thick & thin client


• Types of Provenance – What data source did the data come from?

– Who modified/processed it? And how?

– What rule created it?

– Where (in the text) was it extracted from?

– …

• Common approaches – Reification (efficient implementation in Parliament)

– Named Graphs

• Experience in a number of projects – COBRA – Data source tracking, user belief in a fact

– SID – Time value for which a statement was known to be true

– Machine Reading – Textual provenance, rule provenance

Provenance

43

Person1 E.J. Blatt hasName

HR Records

High

source

confidence


The Co-reference Resolution Problem

• Data from multiple sources is bound to have

– Multiple data elements/values referring to the same physical entity

– This may be caused by

• Inconsistent or incomplete data (e.g. from manual entry errors)

• Bad schemas, or misuse of the schemas

• Inconsistent schema or ontology alignment across data sources

• Inconsistent value conventions across data sources

– Querying across such sources becomes difficult

• The problem is common: COBRA, SID, ISSL/PINT…

• The solution requires Co-reference Resolution

– Identify which data elements refer to the same physical item

– link (or un-link) data records that may (or may not) refer to the same

item

– These processes should be reversible, in the face of new evidence

Person

Person

E.J. Blatt

Eric Blatt

?

hasName

hasName

44


Co-Reference Resolution

• Approaches to merging entities

– Ontological modeling and mapping (along with data value conventions)

– Rules (with procedural extensions) for integrity constraints and matching

– Algorithmic “Distance” Metrics

– Structural similarity

– Learning thresholds for classes of similar objects

• A Framework for Co-Reference Resolution

– Similarity functions: textual, conceptual, spatial, etc.

– Run-time evaluation

– Automated or semi-automated merge decisions

– Non-destructive merge/split

• Examples: COBRA (compete framework), PINT (similarity funct.)

– Analysts liked the option for multiple similarity scores

Person

Person

? similar?

SSN

SSN

E.J. Blatt

Eric Blatt

45


Geospatial and Temporal Reasoning

• Applicable to a wide variety of problem domains:

Logistics, HR, Readiness, C2, Intel, etc.

• Temporal Reasoning

– Show me all employees who had access to Lab 47B

between January and June?

– At which locations is dwell time greater than 2 years?

– How long on the average are O4s and below deployed

in Afghanistan?

• Geospatial Reasoning

– Show me all officers stationed within 200km of Kabul

– What is the average pay of GIs within combat zone B3

46


Geospatial and Temporal Reasoning

Geospatial Analysis & Reporting Conceptual Framework (GARCON-F)

– Challenge: NGA analysts are limited by exhaustive keyword search

through text reports or “flat” tags

– Solution:

• Conceptual Search via mapped ontologies for richer information

exploitation

• Provide Ontological, Geospatial, and Temporal reasoning primitives

• Enable reasoning dependent on context (e.g. “show threats to route”)

– Example query from GARCON: Highlight contextual reasoning

– Used Parliament’s geospatial & temporal indexes for efficiency

Semantically Integrated Databases (SID)

– Provided a unique “snapshot-in-time view” of the Federated Query

result set known to be valid at a particular point in time

– Used Parliament’s temporal index and reasoning primitives

47


Intel Collection Orchestration Loop

48

Information Needs

Planning

Object Identification, Scoring

& Prioritization

Post-

Processing

Information Need and Collection

Strategy Management

Information

Assessment

Data Processing

& Fusion

Distributed & Autonomous

Collection Operations

Problem

Statement

Problem

Decomposition

Collection Strategy Planning

Observation Methods

Identification, Scoring &

Prioritization

Task Allocation

Task Valuation per Collection

Method per Opportunity

Selection

Analyze Existing Intel

Collection Assessment

Orchestrated

Intel Collection


Problem Decomposition

Observation Needs and

Collection Strategy

Derivation

Task Prioritization using

Intel Value Calculus

Scheduling & Collection

Problem Statement Mission Execuction

Intelligence Problem Statement

Prioritized Information Need

Prioritized Observation Need

Prioritized Observation Method

Observation Opportunity

Observation Probability,

Benefit & Cost

Processing/Fusion

Quality Probability

Analyst Customer

Intel Product & SA

Problem

Decomposition

Orchestrated

Collection

Fusion

& Feedback

Mission Assessment

Scheduling

& Collection

[Prospective Sensemaking]

[Retrospective Sensemaking]

[Analysis of Alternatives]

[Cognitiv

e D

ynam

ic A

dapta

tion]


Proposed System Architecture


INFORMATION EXPLOITATION

Technology Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g


51

SILK: SEMANTIC REASONING

Technology Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

SILK


52

SILK Semantic Inferencing on Large Knowledge

Challenge:

Expressiveness sufficient to efficiently represent and reason over complex processes and policies, including defaults and conflict resolution

Solution:

Layered architecture combining standards (RIF) and research results from the Semantic Web and logic programming communities

Results:

Rules language, UI, and inference engine used to model complex policy & process rules & support other applications

Customer:

Vulcan Inc.

Project Halo

http://silk.semwebcentral.org 53


http://silk.semwebcentral.org/

SILK Makes the Enterprise Model Simpler

SILK offers expressive power not found in OWL or RIF

• Powerful conflict

detection and resolution

• Expressive reasoning

about time, processes,

and change

• Hierarchies of exception

cases

“Red Division uses the standard payroll rules except they use their own rules for check cutting.”

“Beginning in July, consulting rates will increase 5%. All other pricing data remains unchanged.”

“Policy A, requiring a purchase order to pay bills, conflicts with policy B, which requires payment within 15 days.”

Taction2 = Taction1 + 5

≠

54


ISSL/PINT

Project Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

ISSL/PINT


55

The Problem Space

• Goal

– Detect and disrupt the terrorist networks manufacturing and

emplacing IEDs

• Requirements

– Understand and encode Red Nodal Processes for doing bad things

– Search through incoming intelligence artifacts for Observations that

can be used as evidence of a Red Nodal Process in action

• Terminology – Red Nodal Process: a documented process by which an enemy organization

achieves some goal (e.g. manufacture of a car bomb)

– Red Nodal Reference Model: a model contains all documented Red Nodal

Processes and their relationships

– Observable: something visible to the human eye or to sensors (e.g. Blue Barrel)

– Observation: an Observable instance, seen at a specific time/place


56

PINT Challenge – The Answer


57

PINT Challenge – The Reality


58

What Makes it Hard

• Red Nodal Processes can change

– Process structure changes

– Observables change

• Activities are not observed, only their indicators

– One piece of evidence can indicate multiple potential Activities

• Most activities within a Process have benign corollaries

– There are no smoking guns to find

– You must find multiple indicators of a Process that occur in the

appropriate order and within an appropriate temporal and spatial

region


59

Example Process

Get Recipe

Buy Eggs

Buy Cheese

Buy Bagel

Buy Muffin

Make Sandwich

Eat Sandwich

• 7 Activities • 1 Decision • 1 Fork

Indicators • Cookbook • Watching Food Network

• Calling Mom


60

Example Process - More Realistic

• 49 Activities •14 Decisions • 3 Forks


61

Identified Path

PINT finds the best fit path within a process


62

PINT

ISSL

ISSL/PINT H

UM

INT

GEO

INT

SIG

INT

Asio™

Reference Model

SME

Cas

e F

ile

Analyst

Det

ect

ed

P

roce

ss

PINT Process

Detection


63

How it works – Process Multiplexer

Processes Observations Processes Observations


64

How it works – Process Finder

[ ]

[ ]

[ ]

[ ]

[ ]

[ ]

[ ]

[ ]

Correlation

[ ] [ ]

[ ] [ ]

[ ]

[ ] [ ]

Bindings Correlated Bindings

[ ]

Heuristic Affinity Values


65

How it works – Process Finder

Cluster of Bindings Detected Solution

[ ] [ ]

[ ] [ ]

[ ]

A

B C

D

Score = 0.91

Map to Process


66

PINT Results

• Results

– Finds answers in a few seconds

• Competitor’s system requires overnight processing to search

for a single Process

– Matches on a single path in the process

– Automatically adapts to changes in the model

• Next Steps for ISSL/PINT

– Begin automating the link between ISSL and PINT

• E.g. Reduce analyst workload by (semi-)automatically

detecting Observables in unstructured intelligence (reports,

signatures, imagery, etc)

• E.g. Learn correlation of terms/indicators in text reports


67

SSDM, Clustering,

Cross-Entropy, Other Tools

Tech Overview

Syn

the

sis

Pre

dic

tio

n

Extr

acti

on

Inte

grat

ion

Re

pre

sen

tati

on

& S

tora

ge

Qu

ery



Re

aso

nin

g

Acq

uis

itio

n

Ind

ex/S

ear

ch

An

alys

is o

f A

lte

rnat

ive

s

Vis

ual

izat

ion

Sim

ilari

ty &

Clu

ste

rin

g Semantic Web

SSDM

Cross-Entropy

S*QL Cytoscape

PINT Clustering


68

SSDM - What is it?

• Structural Semantic Distance Metric (SSDM)

– What does it do?

• Looks at structural components of labeled graphs to

determine similarity between nodes.

• This is related to Google’s “similar pages” feature except that

it attempts to account for the semantics of edges

• Intuition: Nodes are similar if related to (other) similar nodes

– Why is it useful?

• Often, customer data naturally represents (or can be

massaged into) a labeled graph representation.

• While there exist mechanisms to allow for query against such

graphs, they are brittle and require intimate knowledge of the

composition of and vocabulary associated with the graph.


69

Structural Similarity - Examples

70


v

u

t

x

s • sim(x,x) = 1

• sim(s,t) > 0

• sim(u,v) > 0

• sim(u,v) < sim(s,t)

• sim(u,t) = 0

v

u

t

x

s stars in

writes

has genre

has genre y

w

t

x

s stars in

stars in

has genre

has genre

• sim(u,v) > sim(w,y)

In a Directed Graph, nodes

are similar if connected to

other similar nodes • SimRank algorithm

• pSimRank fast

approximation

• Similarity computed by

converging random walks

SSDM extends pSimRank

with the semantics of

labeled edges • Random walks are only

allowed to meet if they

traverse the same sequence

of labels

• “Obscure” or “obvious”

edges can be favored

Semantic Similarity – a Test Case

• The Internet Movie Database contains a wealth

of information about films, actors, directors,

producers, etc. and their interactions

between/among one another.

• RDF: order 100K nodes and 1,000K edges

• Say we are interested in obtaining a list of

movies similar to Star Wars…


71

Querying for “Star Wars”

“Star Wars Gangsta Rap”

Really?!


72

Nodes Similar to <Star Wars>

Using SSDM offers more intuitively-correct results: e.g. other movies

directed by George Lucas, or starring Mark Hamill (Luke Skywalker)


73

Beyond Just Similarity

• Minor tweaks allow for similarity explanations

• Can tune similarity scores to favor:

– Obvious explanations (e.g. Director=George Lucas)

– Obscure explanations (e.g. Editor=T.M. Christopher)

• Very scalable and highly parallelizable algorithm

– Similar to Google’s PageRank

• There are still a number of optimizations we can

conceive of that would make the system even

faster


74

Clustering

• What does it do? – Determines groupings of ‘nearby’ items in large sets of

data under some salient measure of proximity

• Why is it useful? – Summarization

• E.g semi supervised learning of ontologies

– Outlier detection

• Useful for anomaly detection systems

– Linkages between groups

• Often interlinks between clusters are interesting. They can for instance provide insight as to how information is shared among otherwise disparate groups.


75

Clustering - Applications

Relationships

between

groups Outlier

Interesting Connections and Outliers Semi-Supervised Taxonomy Learning


76

Clustering – How it works

• Using state-of-the-art spectral clustering for high

dimensional data (working prototype).

• Tested on 197,000 dimensional data sets (Movie

Database)

• Modular implementation:

– A java based k-means clusterer; can plug in others

– A java based information theoretic heuristic for

discovering the number of clusters

– Clustering can use SSDM as its distance metric

– …or other similarity metrics


77

Cross-Entropy

• What does it do?

– Theoretically sound approach to tackling nasty

optimization problems … approximately

• Why is it useful?

– General purpose

• applicable for large set of graph-related problems (and other

problems)

– Only few parameters to configure

– Fast, and iterative – you can stop at any time and get

an approximate solution


78

Cross-Entropy Application –

Quadratic Assignment

• NP-Hard optimization problem

• Real-world usage in hospital design/layout

– Ontology alignment with multiple competing similarity

metrics is another possible formulation (many, many

more)

• QA problem proposed in 1968 finally solved in

2000 with world-wide network of 100’s of

computers after 7 days or CPU time

– 98.7% solution estimated with Cross-Entropy

prototype on my laptop in 86 seconds


79

Other tools and why they matter

• OpenNLP

– An open source natural language processing tool enables

• Detection of names entities (i.e. persons, places, organizations)

• Pronoun reference resolution

• Parsing

– Having rudimentary NLP capabilities would allow us to

rapidly build structured data sets from text corpora which

we need to test exploitation components.

– People want to see our algorithms operate on real data

and the harsh reality is there isn’t an abundance of

relevant RDF data out there to work with


80

Other tools and why they matter

• Mallet and GRMM (Graphical Models framework)

– An open source (U Mass) graphical models package which

allows us to stand up sizable Bayes nets and Markov nets

and do approximate inference

– These graph models play a central role in Knowledge

Representation (KR) and are helpful especially when it

comes to dealing with fuzziness and noise in the data

– Further, there is great interest in fusing graph models with

subset of first order logic. The field that deals with such

fusion is called Statistical Relational Learning. In order to

explore this field in any detail, it is essential to have a good

graphical models package.


81

Future Possibilities

• Other Application Domains

– Anomaly detection without a priori training datasets

– Large scale process compliance metrics

– Network attack detection

– Heuristic resource optimization (e.g. for ISR missions)

• Sample Research Topics

– Learning Processes from data

– Automated extraction of indicators from unstructured sources

(intel reports, open sources, imagery, MASINT, etc.)

– Highly-distributed Graph Analysis (using cloud architecture)

– Learning ontologies (relationships) from raw data


82

Summary

• Traditional strength in Semantic Web technology

– Data modeling

– Data integration

– Reasoning

• Expanding in Information Exploitation

– Semantic graph analysis, similarity, clustering

– Process matching

– Optimization

– … with focus on very fast approximations with large

scalability potential


83

Conclusion

• BBN offers a wealth of opportunities in Knowledge Engineering and Data Science, with applications, relevant to many customer domains

• Our experience, versatility and success make us a valuable resource and an exciting research partner

Over 15 years of experience researching, developing & deploying Semantic Web

applications to DoD customers – a solid track record.

• We actively seek new members for our research teams:

– Arlington: Information Exploitation, Sensor Systems

– Columbia: Speech & Language, Cybersecurity

– Visit: careers.bbn.com 84


QUESTIONS?

Dr. Plamen Petrov

[email protected]

703-284-1299


85