Upload
plamen-petrov
View
118
Download
0
Tags:
Embed Size (px)
Citation preview
Raytheon BBN Technologies
From Knowledge Management to
Information Exploitation
Dr. Plamen Petrov
March, 2015
© 2015, Raytheon BBN Technologies
Topics
• Customer-driven direction
• Semantic Web technology & tools
– Query federation - Asio™
– Efficient storage - Parliament™ and SHARD
• Applications
– Data integration, co-reference resolution, provenance
– Geospatial and temporal reasoning
– Orchestrated intel collection
• Beyond Semantic Web
– Pattern detection
– Structural semantic similarity
– Clustering, approximate optimization, other tools
BBN provides a wealth of capabilities and tools paired
with deep expertise in Semantic Web and data
integration.
© 2015, Raytheon BBN Technologies
2
Motivation & New Challenges
• Strength: Historically, we are an applied Semantic Web
group specializing in data integration & knowledge mgmt
• Goal: Apply our expertise to a wider set of information
analysis problems
– Wide variety of graph-related problems
– Structured and unstructured data
– Larger datasets
• Reasons
– Emphasis of recent projects and new funding opportunities has
expanded beyond data integration and knowledge management
– Our group’s capabilities and potential have expanded beyond
the narrow area historically associated with the Semantic Web
© 2015, Raytheon BBN Technologies
3
Historic and Current Capabilities
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
Asio™ Parliament™ SILK
SSDM
Cross-Entropy
S*QL Cytoscape
ISSL/PINT
• Existing tools and algorithms to address a spectrum of needs
• Focus on scalability, reusability, and semantic graph analytics
Current Technologies Overlay
© 2015, Raytheon BBN Technologies
4
Customer-driven Intuition
• Our customers’ needs are expanding beyond the blue boxes
• We have developed expertise to contribute to these new areas
• Working Knowledge Management and Info Exploitation together
provides opportunities to distinguish us from competitors
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g
© 2015, Raytheon BBN Technologies
5
SEMANTIC WEB TECHNOLOGIES
Technology Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
Asio™ Parliament™ SILK
© 2015, Raytheon BBN Technologies
6
Key BBN-Led Semantic Initiatives
SemWebCentral.org
asio.bbn.com
W3C OWL Recommendation
2000 2001 2002 2003 2004 2005 2006 2007 2008
FCG (AFRL/AMC)
NOTAMS (AFRL/AMC)
Horus (DARPA/IMO)
Combine/APSTARS
IEII (DARPA)
ICEWS (DARPA)
2009 2010
DAML Integration & Transition (DARPA)
JFP ACTD (JFCOM)
GARCON-F (NGA)
Geospatial SW (NGA)
Integrated Learning (DARPA)
DIESL (DARPA)
SASSI/MMON
COBRA
BBN Hosts
ISSL
PINT
Machine Reading (DARPA)
KAHT (DARPA)
2011
RC2 (DARPA)
TASM (AFRL/AMC)
Omega (ONI)
SILK (commercial)
BCBS (commercial)
SHARD
Parliament™ Open Sourced
BBN Hosts SMW Conf.
DAML DB Parliament™
2012
Asio™ Scout
SID REX
CyberOnt
Multi-INT Fusion (LM)
7
© 2015, Raytheon BBN Technologies
Experience Highlights
Data Integration Asio™, Parliament™
JFP, COBRA, SID, ISSL…
Foundational Work
Reasoning
•Temporal
•Geospatial
•Logic
Ontology Development
NLP and Text Extraction
Open System
• Standards
• Open Source
• DARPA DAML
• DAML DB
• Combine/APSTARS
• Geospatial SW
• GARCON-F
• SID
• SILK
• Combine/APSTARS
• SID
• GARCON-F
• COBRA, CyberOnt
• RC2
• W3C
• OGC
• SemWebCentral
• Open Ontology
Repository
• Machine Reading
• Topic Detection
• Entity/Relation Extraction
8
© 2015, Raytheon BBN Technologies
Active Participation in SemWeb
• World Wide Web Consortium (W3C)
• Open Geospatial Consortium (OGC)
• US Geospatial Intelligence Foundation (USGIF)
• Terra Cognita workshops on Geospatial SemWeb
• Conferences and publications
– Semantic Web Programming book
– Organized ISWC 2009
– Presence/sponsorship at major SemWeb conferences
– Continuous involvement in STIDS since its inception
• Open Source Community
– Maintain SemWebCentral.org
– Contribute to open source projects
9
© 2015, Raytheon BBN Technologies
PARLIAMENT TRIPLE STORE
Technology Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
Parliament™
© 2015, Raytheon BBN Technologies
10
Parliament™ Triple Store
[ Parliament: n. A group of owls. ]
Overview • RDF Triple Store with SPARQL query support
– First developed under DARPA DAML (DAML DB)
– In continuous use by customers for ~8 years
• Released as open source project in June 2009 – http://parliament.semwebcentral.com/
• Based on W3C standards for the Semantic Web – RDF, RDFS, OWL, SPARQL
Differentiators • Balanced query, insert, and space performance through a unique index structure
• Fast forward-chaining RDFS inference engine
• State-of-the-art optimization of complex queries – Hundreds of triple patterns (equivalent to 30+ joins)
• Efficient temporal and geospatial queries without proprietary SPARQL extensions
• Support for efficient reification 11
© 2015, Raytheon BBN Technologies
Parliament – Layered Architecture
• Supports both Jena and Sesame
– Jena is highly customized to take advantage of indexing for Complex Queries
• Storage and Rule engine implemented in C++ for performance
Java Native Interface
Parliament
Jetty + Joseki
Jena
Jena Graph for Parliament
Rule
Engine
Operating System Berkeley DB
Sesame SAIL for
Parliament
Sesame Jena Graph for Indexes
Named Graph Support
Temporal Index
Geospatial
Index
SPARQL
SERQL
C++
Java
Third-Party Components
Parliament Components
12
© 2015, Raytheon BBN Technologies
Parliament Statement Table
Each entry (statement) contains:
• Three resource ID fields: Subject, predicate,
and object of the statement
• Three statement ID fields: Next statements
using the same resource as subject, predicate,
and object
• Bit-field flags encoding statement attributes
• Recently: a statement id field to support
reification
13
© 2015, Raytheon BBN Technologies
Parliament Resource Dictionary
Each entry (resource) contains:
• Bidirectional string-to-ID mapping
• Three statement ID fields: First statements
using this resource as subject, predicate, and
object
• Three count fields: Numbers of statements
using this resource as subject, predicate, and
object
• Bit-field flags encoding resource attributes
14
© 2015, Raytheon BBN Technologies
Parliament’s Index Structure
• Dynamic applications often require efficient statement insertion (as opposed to bulk loading)
• Goal: Balanced insertion, query performance, and storage space required
• Parliament stores triples using two components: – Resource dictionary
– Statement table
• Parliament optimizes queries using: – Subject, Predicate, and Object indexes and size
counts
– These are maintained virtually for free
15
© 2015, Raytheon BBN Technologies
Parliament Performance
• Parliament maintains excellent query performance
for complex queries* while significantly increasing
throughput and decreasing space requirements *Queries equivalent to 30+ SQL joins
• Current and future work includes:
– Parallelization to a cloud architecture: Hadoop/Accumulo
– Query optimization strategies in a distributed architecture
– Analysis of Parliament’s internal rule engine
– Further optimizations to the native storage structure
16
© 2015, Raytheon BBN Technologies
SHARD: Triple-Store Built on
Prioritized Goals
• Commodity hardware ONLY
• Highly scalable
• Decentralized computing
• Robust to node failures
Design Considerations For
• SPARQL
• Complex queries
• Large query responses
=> Distributed Query Optimization
Functional Overview of SHARD proof-of-concept • Method calls at client
• Clause-iteration via Map-Reduce jobs
• Iterate over query clauses to find partial query matches
• Join partial query responses with flagged keys
• Move results to local machine for local drill-down
• Hadoop abstraction layer manages partial system failures with storage and computation redundancy
17 © 2015, Raytheon BBN Technologies
SHARD v1 Benchmarking
Query Type SHARD v01 Parliament +Sesame*
Parliament +Jena*
Simple Query, Small Result Set (Query 1)
404 sec. (~0.1 hr.)
0.1hr 0.001hr
Triangular Query (Query 9)
740 sec.
(~0.2 hr.) 1hr 1hr
Simple Query, Large Result Set(Query 14)
118 sec.
(~0.03 hr.)
1hr 5hr
s o p
o
s o
s o p
*K. Rohloff, M. Dean, I. Emmons, D. Ryder, J. Sumner. “An Evaluation of Triple-Store Technologies for
Large Data Stores.” 3rd International Workshop On Scalable Semantic Web Knowledge Base
Systems (SSWS '07), 2007.
• Deployed code on Amazon EC2 cloud.
– 19 XL nodes.
• LUBM (Lehigh Univ. BenchMark)
– Artificial data on students, professors, courses, etc… at universities
• 800 million edge graph, 6000 LUBM university dataset
• In general, performed favorably to “industrial” monolithic triple-stores
18
© 2015, Raytheon BBN Technologies
Key Design Issues, From Experience
• Data partitioning is key
– Standard hash partitioning limits scalable performance
because it makes decentralized processing almost
impossible
– Decentralized processing more efficient if related data
is colocated
– There are provenance and fault tolerance issues
• Indexing vs. full data scans
– Lack of indexing enables more efficient data ingest and
scans, but severely limits look-ups
– Need to be sensitive to applications
– Possible win through partial distributed indexing
19
© 2015, Raytheon BBN Technologies
Future Triple Store Requirements
• Customer needs
– Highly scalable and cost-effective graph data storage and
efficient inferencing/querying
– Operate over standard infrastructure such as Hadoop/HDFS,
interface with Accumulo/CloudBase
– Use through standards-based APIs: SPARQL, REST, SOAP
– Comply with Jena, Sesame, or BlackBook
– Scalable to very large graphs, yet agile for high-throughput
• Goals for next generation of Parliament and SHARD
– Redesign Parliament’s optimizations for a cloud-based
architecture to support customer mission objectives
– Support large graph data ingests, inferencing, and efficient
querying on commodity Hadoop/Accumulo deployment
20
© 2015, Raytheon BBN Technologies
SEMANTIC QUERY FEDERATION
Technology Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
Asio™
© 2015, Raytheon BBN Technologies
21
Overview
A set of runtime software and development tools designed to address the requirement for robust data federation across disparate sources. Based on World Wide Web Consortium (W3C) industry standards, the Asio™ tool suite allows systems and users to reach a shared understanding about the meaning, structure, and context of the data exchanged.
Differentiators
• A very expressive mapping language gives freedom in connecting data sources to domain ontology
• Advanced execution engine delivers query results quickly
• Workbench components significantly reduce configuration time and effort
• Asio™ has been in active development since 2006; v.1 in 2007
http://asio.bbn.com/
Asio™ Tool Suite
22
© 2015, Raytheon BBN Technologies
Architectural View
Web
Service
WSDL
WSDL
Ontology
OWL
Mapping
Ontology
OWL
SWRL Rules
RDBMS
Domain Source
OntologyOWL
Query: SPARQL
Data Source
Ontology
OWL
Data Source
Ontology
OWL
Semantic Bridge
Rel. Database
Semantic Bridge
Web Service
Snoggle
Semantic Query Decomposition (SQD)
Semantic Bridge
SPARQL Endpoint
Automapper
RDF
Triple Store
SQL REST/SOAP SPARQL
SPARQL SPARQL SPARQL
tm
Asi
o™
Wo
rkb
ench
23
Domain
Ontology OWL
© 2015, Raytheon BBN Technologies
RDBMS One
Web Service
SPARQL Endpoint
RDBMS Two
SPARQL Query
1
Query Decomposition
2
4 Data Access
6
Query Result Set
Semantic Query Decomposition (SQD)
Semantic Bridge Rel. Database
5
Backwards Rule Chaining
3 Generation of Sub Queries
Semantic Bridge Rel. Database
Semantic Bridge SPARQL Endpoint
Semantic Bridge Web Service
Asio™ Federated Query
24
© 2015, Raytheon BBN Technologies
Asio™ Strengths
• Very expressive mapping language (SWRL with functional extensions) – Allows significantly more mapping power than just OWL
mappings (equivalentClass, subClass, equivalentProperty, etc.)
– Truly independent domain ontology
– Required for real data unification
• Efficient streaming result sets – UIs need answers quickly
– Lower memory requirements mean more concurrent users
• Advanced execution engine – Ontology reasoning for reduced intermediate data set sizes
– SWRL builtin to SQL function converter pushes filtering to low levels
– Configurable batch size to handle higher latency situations
25
© 2015, Raytheon BBN Technologies
Why Mapping Language Matters
• Real world denormalized database schemas do
not match well with ontologies:
ID Name Dept Name Dept Location
1 John Smith
Finance B1
2 Jane Doe
Contracts B1
3 Steve Stevens
IT B2
4 Tom Thomas
IT B2
Employee
Department
Building string
This mapping requires creating a new entity, Building, from the Person table row.
26
© 2015, Raytheon BBN Technologies
Why Mapping Language Matters
• Sometimes a good schema can be stretched too
far:
ID Name Manager
1 John Smith Stephanie Fox
2 Jane Doe John Smith; Tom Thomas
3 Steve Stevens John Smith; Frank Foster
Jane Doe
John Smith
Tom Thomas
This mapping requires splitting the “manager” column into multiple parts by invoking procedural extensions.
27
© 2015, Raytheon BBN Technologies
Why Mapping Language Matters
• Even when the schema is reasonable, it may not
be the way the domain should be modeled:
ID Name Title
1 John Smith Vice President
2 Jane Doe Director
3 Steve Stevens Director
Employee
Manager
Director Vice
President
The type of the entity in the row must be mapped from a string value in a column.
subclassOf
28
© 2015, Raytheon BBN Technologies
Asio™ “Streaming” Result Sets
• Forming federated query results can be
complicated, and require many intermediate
queries
• User Interfaces still need to start showing
progress quickly, even if there are hundreds or
thousands of results
Solution: filter and process individual results on the
fly, and start returning them immediately
* Some queries, like those that involve DISTINCT, COUNT, or ORDER
BY, will still need to be fully processed before returning an answer
29
© 2015, Raytheon BBN Technologies
Asio™ “Streaming” Result Sets
30
A B C D D E F G G H I J K L L M N O
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
Streaming result sets start feeding query results to the user interface immediately.
© 2015, Raytheon BBN Technologies
Asio™ “Streaming” Result Sets
31
A B C D D E F G G H I J K L L M N O
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
Streaming result sets start feeding query results to the user interface immediately.
© 2015, Raytheon BBN Technologies
Asio™ “Streaming” Result Sets
32
A B C D D E F G G H I J K L L M N O
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
Streaming result sets start feeding query results to the user interface immediately.
© 2015, Raytheon BBN Technologies
Asio™ “Streaming” Result Sets
33
A B C D D E F G G H I J K L L M N O
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
………………..
Streaming result sets start feeding query results to the user interface immediately.
© 2015, Raytheon BBN Technologies
Asio™ Advanced Execution Engine
• Query Rewriting creates Efficient Subqueries – Ontology Reasoning
Disjoint Classes
• Liberal use of disjointness statements in the ontology help to reduce generated UNIONS in certain domain ontology situations
• Pairwise disjointness can be asserted automatically for some data source ontologies
Functional / Inverse Functional Properties
• Many unbound variables introduced in the rule expansion stage are unified
– SWRL Builtin <-> SPARQL Filter <-> SQL WHERE equivalence
• Configurable for various data source distributions – Streaming block size can be changed for different latency
situations
34
© 2015, Raytheon BBN Technologies
Asio™ - Other Features
• Subclass/Subproperty reasoning – This creates more possibilities for inferring query
statements (in the way that you would expect)
• No need to do mappings on a per-query basis – Mappings are written from data source to domain
– If a data source is removed, mappings written for other sources remain unchanged
• Multiple User Perspectives and Access Control – Distinct sets of mapping rules are defined on a per-user or
per-group basis
– This allows different users or groups to have different perspectives on the same data
– Provides access control through available mappings
35
© 2015, Raytheon BBN Technologies
Asio™ Workbench
• Workbench significantly reduces time and effort for Asio™ configuration – Works within Eclipse development environment
– Second generation implementation with Wizards and XML views
– Being deployed with Asio™ for Raytheon-internal data management 36
Integrated management of
data sources
Source code view allows editing of XML
Specialized ASIO perspective
Configuration file access
DB Connection Wizard
Test Connection Immediately
© 2015, Raytheon BBN Technologies
36
Asio™ Performance
• Preprocessing of a query takes milliseconds
• “Streaming” results means that you start getting query
answers back quickly, even if there are many results
– This improved our performance by two orders of magnitude
• The processing has very little overhead over just executing
the queries
• Many stateless instances can run on the same machine
37
© 2015, Raytheon BBN Technologies
Needs Beyond a Stateless Architecture
• Caching results
– The source data may not be always reachable
– Caching should improve performance
• Storing metadata
– For co-reference resolution
– For provenance
– Other “persisting” annotations
• Medium- to High- performance Triple Stores
needed
– Depending on implementation strategy
– Query performance optimization may be more
valuable than sheer volume of data stored
38
© 2015, Raytheon BBN Technologies
DATA INTEGRATION,
CO-REFERENCE RESOLUTION
Technology Applications
© 2015, Raytheon BBN Technologies
39
Data Integration Examples: SID & ISSL
• Semantically Integrated Databases (SID)
– Challenge: data for risk analysis scattered among five different
RDBs
– Solution:
• Semantic alignment of data using Asio™
• Non-destructive co-reference resolution of instance data
• Federated data tagged with provenance & time-interval of validity
• Integrated Semantic Search Layer (ISSL)
– Challenge: many disparate RDBs contain the data necessary to
correlate and identify activities of interest
– Solution:
• Semantic alignment of data using Asio™ to a domain ontology
• Integration between RDBMS and RDF data sources
• Streaming federated queries across all databases, while
maintaining provenance information
40
© 2015, Raytheon BBN Technologies
Data Integration Examples - COBRA
• Collaborative Ontology-Based Reasoning Architecture
– Create a unified common data environment
– Integrate disparate data sources; use “analysis case” as a container
– Identify and resolve co-references using rules, recommendations & human experts
– Provide flexible data mapping to a domain ontology & a shared knowledge base
– Data is queried and used in multiple analytical views via application plugins
– Ontology drives the User Interface
Flat Files
RDBMS
Data Integration
Convert to OWL/RDF
Map to common Domain
Ontology
Track provenance
Result: Harmonized
and integrated data
expressed in the end-
users’ vocabulary
Rapid integration of
analysis teams’ ad hoc
files
Knowledge Base
Common Data Environment
INT reports
Co-Reference Resolution
Identify duplicates
Choose versions of
attribute values
Result: Clean data,
ready for analysis
Non-destructive merge
and unmerge
Web Service
Live Feeds
41 Plug-ins
COTS
Analysis
Tools
© 2015, Raytheon BBN Technologies
Data Ingest & Deconfliction UIs
42
• Although we focus mainly on back-end components,
we have developed, deployed, and integrated with
many sophisticated user interfaces
– Emphasis on usability, working closely with customers
– Experience with multiple UI toolkits, thick & thin client
© 2015, Raytheon BBN Technologies
• Types of Provenance – What data source did the data come from?
– Who modified/processed it? And how?
– What rule created it?
– Where (in the text) was it extracted from?
– …
• Common approaches – Reification (efficient implementation in Parliament)
– Named Graphs
• Experience in a number of projects – COBRA – Data source tracking, user belief in a fact
– SID – Time value for which a statement was known to be true
– Machine Reading – Textual provenance, rule provenance
Provenance
43
Person1 E.J. Blatt hasName
HR Records
High
source
confidence
© 2015, Raytheon BBN Technologies
The Co-reference Resolution Problem
• Data from multiple sources is bound to have
– Multiple data elements/values referring to the same physical entity
– This may be caused by
• Inconsistent or incomplete data (e.g. from manual entry errors)
• Bad schemas, or misuse of the schemas
• Inconsistent schema or ontology alignment across data sources
• Inconsistent value conventions across data sources
– Querying across such sources becomes difficult
• The problem is common: COBRA, SID, ISSL/PINT…
• The solution requires Co-reference Resolution
– Identify which data elements refer to the same physical item
– link (or un-link) data records that may (or may not) refer to the same
item
– These processes should be reversible, in the face of new evidence
Person
Person
E.J. Blatt
Eric Blatt
?
hasName
hasName
44
© 2015, Raytheon BBN Technologies
Co-Reference Resolution
• Approaches to merging entities
– Ontological modeling and mapping (along with data value conventions)
– Rules (with procedural extensions) for integrity constraints and matching
– Algorithmic “Distance” Metrics
– Structural similarity
– Learning thresholds for classes of similar objects
• A Framework for Co-Reference Resolution
– Similarity functions: textual, conceptual, spatial, etc.
– Run-time evaluation
– Automated or semi-automated merge decisions
– Non-destructive merge/split
• Examples: COBRA (compete framework), PINT (similarity funct.)
– Analysts liked the option for multiple similarity scores
Person
Person
? similar?
SSN
SSN
E.J. Blatt
Eric Blatt
45
© 2015, Raytheon BBN Technologies
Geospatial and Temporal Reasoning
• Applicable to a wide variety of problem domains:
Logistics, HR, Readiness, C2, Intel, etc.
• Temporal Reasoning
– Show me all employees who had access to Lab 47B
between January and June?
– At which locations is dwell time greater than 2 years?
– How long on the average are O4s and below deployed
in Afghanistan?
• Geospatial Reasoning
– Show me all officers stationed within 200km of Kabul
– What is the average pay of GIs within combat zone B3
46
© 2015, Raytheon BBN Technologies
Geospatial and Temporal Reasoning
Geospatial Analysis & Reporting Conceptual Framework (GARCON-F)
– Challenge: NGA analysts are limited by exhaustive keyword search
through text reports or “flat” tags
– Solution:
• Conceptual Search via mapped ontologies for richer information
exploitation
• Provide Ontological, Geospatial, and Temporal reasoning primitives
• Enable reasoning dependent on context (e.g. “show threats to route”)
– Example query from GARCON: Highlight contextual reasoning
– Used Parliament’s geospatial & temporal indexes for efficiency
Semantically Integrated Databases (SID)
– Provided a unique “snapshot-in-time view” of the Federated Query
result set known to be valid at a particular point in time
– Used Parliament’s temporal index and reasoning primitives
47
© 2015, Raytheon BBN Technologies
Intel Collection Orchestration Loop
48
Information Needs
Planning
Object Identification, Scoring
& Prioritization
Post-
Processing
Information Need and Collection
Strategy Management
Information
Assessment
Data Processing
& Fusion
Distributed & Autonomous
Collection Operations
Problem
Statement
Problem
Decomposition
Collection Strategy Planning
Observation Methods
Identification, Scoring &
Prioritization
Task Allocation
Task Valuation per Collection
Method per Opportunity
Selection
Analyze Existing Intel
Collection Assessment
Orchestrated
Intel Collection
© 2015, Raytheon BBN Technologies
Problem Decomposition
Observation Needs and
Collection Strategy
Derivation
Task Prioritization using
Intel Value Calculus
Scheduling & Collection
Problem Statement Mission Execuction
Intelligence Problem Statement
Prioritized Information Need
Prioritized Observation Need
Prioritized Observation Method
Observation Opportunity
Observation Probability,
Benefit & Cost
Processing/Fusion
Quality Probability
Analyst Customer
Intel Product & SA
Problem
Decomposition
Orchestrated
Collection
Fusion
& Feedback
Mission Assessment
Scheduling
& Collection
[Prospective Sensemaking]
[Retrospective Sensemaking]
[Analysis of Alternatives]
[Cognitiv
e D
ynam
ic A
dapta
tion]
49 © 2015, Raytheon BBN Technologies
Proposed System Architecture
50 © 2015, Raytheon BBN Technologies
INFORMATION EXPLOITATION
Technology Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g
© 2015, Raytheon BBN Technologies
51
SILK: SEMANTIC REASONING
Technology Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
SILK
© 2015, Raytheon BBN Technologies
52
SILK Semantic Inferencing on Large Knowledge
Challenge:
Expressiveness sufficient to efficiently represent and reason over complex processes and policies, including defaults and conflict resolution
Solution:
Layered architecture combining standards (RIF) and research results from the Semantic Web and logic programming communities
Results:
Rules language, UI, and inference engine used to model complex policy & process rules & support other applications
Customer:
Vulcan Inc.
Project Halo
http://silk.semwebcentral.org 53
© 2015, Raytheon BBN Technologies
SILK Makes the Enterprise Model Simpler
SILK offers expressive power not found in OWL or RIF
• Powerful conflict
detection and resolution
• Expressive reasoning
about time, processes,
and change
• Hierarchies of exception
cases
“Red Division uses the standard payroll rules except they use their own rules for check cutting.”
“Beginning in July, consulting rates will increase 5%. All other pricing data remains unchanged.”
“Policy A, requiring a purchase order to pay bills, conflicts with policy B, which requires payment within 15 days.”
Taction2 = Taction1 + 5
≠
54
© 2015, Raytheon BBN Technologies
ISSL/PINT
Project Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
ISSL/PINT
© 2015, Raytheon BBN Technologies
55
The Problem Space
• Goal
– Detect and disrupt the terrorist networks manufacturing and
emplacing IEDs
• Requirements
– Understand and encode Red Nodal Processes for doing bad things
– Search through incoming intelligence artifacts for Observations that
can be used as evidence of a Red Nodal Process in action
• Terminology – Red Nodal Process: a documented process by which an enemy organization
achieves some goal (e.g. manufacture of a car bomb)
– Red Nodal Reference Model: a model contains all documented Red Nodal
Processes and their relationships
– Observable: something visible to the human eye or to sensors (e.g. Blue Barrel)
– Observation: an Observable instance, seen at a specific time/place
© 2015, Raytheon BBN Technologies
56
PINT Challenge – The Answer
© 2015, Raytheon BBN Technologies
57
PINT Challenge – The Reality
© 2015, Raytheon BBN Technologies
58
What Makes it Hard
• Red Nodal Processes can change
– Process structure changes
– Observables change
• Activities are not observed, only their indicators
– One piece of evidence can indicate multiple potential Activities
• Most activities within a Process have benign corollaries
– There are no smoking guns to find
– You must find multiple indicators of a Process that occur in the
appropriate order and within an appropriate temporal and spatial
region
© 2015, Raytheon BBN Technologies
59
Example Process
Get Recipe
Buy Eggs
Buy Cheese
Buy Bagel
Buy Muffin
Make Sandwich
Eat Sandwich
• 7 Activities • 1 Decision • 1 Fork
Indicators • Cookbook • Watching Food Network
• Calling Mom
© 2015, Raytheon BBN Technologies
60
Example Process - More Realistic
• 49 Activities •14 Decisions • 3 Forks
© 2015, Raytheon BBN Technologies
61
Identified Path
PINT finds the best fit path within a process
© 2015, Raytheon BBN Technologies
62
PINT
ISSL
ISSL/PINT H
UM
INT
GEO
INT
SIG
INT
Asio™
Reference Model
SME
Cas
e F
ile
Analyst
Det
ect
ed
P
roce
ss
PINT Process
Detection
© 2015, Raytheon BBN Technologies
63
How it works – Process Multiplexer
Processes Observations Processes Observations
© 2015, Raytheon BBN Technologies
64
How it works – Process Finder
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
Correlation
[ ] [ ]
[ ] [ ]
[ ]
[ ] [ ]
Bindings Correlated Bindings
[ ]
Heuristic Affinity Values
© 2015, Raytheon BBN Technologies
65
How it works – Process Finder
Cluster of Bindings Detected Solution
[ ] [ ]
[ ] [ ]
[ ]
A
B C
D
Score = 0.91
Map to Process
© 2015, Raytheon BBN Technologies
66
PINT Results
• Results
– Finds answers in a few seconds
• Competitor’s system requires overnight processing to search
for a single Process
– Matches on a single path in the process
– Automatically adapts to changes in the model
• Next Steps for ISSL/PINT
– Begin automating the link between ISSL and PINT
• E.g. Reduce analyst workload by (semi-)automatically
detecting Observables in unstructured intelligence (reports,
signatures, imagery, etc)
• E.g. Learn correlation of terms/indicators in text reports
© 2015, Raytheon BBN Technologies
67
SSDM, Clustering,
Cross-Entropy, Other Tools
Tech Overview
Syn
the
sis
Pre
dic
tio
n
Extr
acti
on
Inte
grat
ion
Re
pre
sen
tati
on
& S
tora
ge
Qu
ery
Knowledge Management
Information Exploitation
Re
aso
nin
g
Acq
uis
itio
n
Ind
ex/S
ear
ch
An
alys
is o
f A
lte
rnat
ive
s
Vis
ual
izat
ion
Sim
ilari
ty &
Clu
ste
rin
g Semantic Web
SSDM
Cross-Entropy
S*QL Cytoscape
PINT Clustering
© 2015, Raytheon BBN Technologies
68
SSDM - What is it?
• Structural Semantic Distance Metric (SSDM)
– What does it do?
• Looks at structural components of labeled graphs to
determine similarity between nodes.
• This is related to Google’s “similar pages” feature except that
it attempts to account for the semantics of edges
• Intuition: Nodes are similar if related to (other) similar nodes
– Why is it useful?
• Often, customer data naturally represents (or can be
massaged into) a labeled graph representation.
• While there exist mechanisms to allow for query against such
graphs, they are brittle and require intimate knowledge of the
composition of and vocabulary associated with the graph.
© 2015, Raytheon BBN Technologies
69
Structural Similarity - Examples
70
© 2015, Raytheon BBN Technologies
v
u
t
x
s • sim(x,x) = 1
• sim(s,t) > 0
• sim(u,v) > 0
• sim(u,v) < sim(s,t)
• sim(u,t) = 0
v
u
t
x
s stars in
writes
has genre
has genre y
w
t
x
s stars in
stars in
has genre
has genre
• sim(u,v) > sim(w,y)
In a Directed Graph, nodes
are similar if connected to
other similar nodes • SimRank algorithm
• pSimRank fast
approximation
• Similarity computed by
converging random walks
SSDM extends pSimRank
with the semantics of
labeled edges • Random walks are only
allowed to meet if they
traverse the same sequence
of labels
• “Obscure” or “obvious”
edges can be favored
Semantic Similarity – a Test Case
• The Internet Movie Database contains a wealth
of information about films, actors, directors,
producers, etc. and their interactions
between/among one another.
• RDF: order 100K nodes and 1,000K edges
• Say we are interested in obtaining a list of
movies similar to Star Wars…
© 2015, Raytheon BBN Technologies
71
Querying for “Star Wars”
“Star Wars Gangsta Rap”
Really?!
© 2015, Raytheon BBN Technologies
72
Nodes Similar to <Star Wars>
Using SSDM offers more intuitively-correct results: e.g. other movies
directed by George Lucas, or starring Mark Hamill (Luke Skywalker)
© 2015, Raytheon BBN Technologies
73
Beyond Just Similarity
• Minor tweaks allow for similarity explanations
• Can tune similarity scores to favor:
– Obvious explanations (e.g. Director=George Lucas)
– Obscure explanations (e.g. Editor=T.M. Christopher)
• Very scalable and highly parallelizable algorithm
– Similar to Google’s PageRank
• There are still a number of optimizations we can
conceive of that would make the system even
faster
© 2015, Raytheon BBN Technologies
74
Clustering
• What does it do? – Determines groupings of ‘nearby’ items in large sets of
data under some salient measure of proximity
• Why is it useful? – Summarization
• E.g semi supervised learning of ontologies
– Outlier detection
• Useful for anomaly detection systems
– Linkages between groups
• Often interlinks between clusters are interesting. They can for instance provide insight as to how information is shared among otherwise disparate groups.
© 2015, Raytheon BBN Technologies
75
Clustering - Applications
Relationships
between
groups Outlier
Interesting Connections and Outliers Semi-Supervised Taxonomy Learning
© 2015, Raytheon BBN Technologies
76
Clustering – How it works
• Using state-of-the-art spectral clustering for high
dimensional data (working prototype).
• Tested on 197,000 dimensional data sets (Movie
Database)
• Modular implementation:
– A java based k-means clusterer; can plug in others
– A java based information theoretic heuristic for
discovering the number of clusters
– Clustering can use SSDM as its distance metric
– …or other similarity metrics
© 2015, Raytheon BBN Technologies
77
Cross-Entropy
• What does it do?
– Theoretically sound approach to tackling nasty
optimization problems … approximately
• Why is it useful?
– General purpose
• applicable for large set of graph-related problems (and other
problems)
– Only few parameters to configure
– Fast, and iterative – you can stop at any time and get
an approximate solution
© 2015, Raytheon BBN Technologies
78
Cross-Entropy Application –
Quadratic Assignment
• NP-Hard optimization problem
• Real-world usage in hospital design/layout
– Ontology alignment with multiple competing similarity
metrics is another possible formulation (many, many
more)
• QA problem proposed in 1968 finally solved in
2000 with world-wide network of 100’s of
computers after 7 days or CPU time
– 98.7% solution estimated with Cross-Entropy
prototype on my laptop in 86 seconds
© 2015, Raytheon BBN Technologies
79
Other tools and why they matter
• OpenNLP
– An open source natural language processing tool enables
• Detection of names entities (i.e. persons, places, organizations)
• Pronoun reference resolution
• Parsing
– Having rudimentary NLP capabilities would allow us to
rapidly build structured data sets from text corpora which
we need to test exploitation components.
– People want to see our algorithms operate on real data
and the harsh reality is there isn’t an abundance of
relevant RDF data out there to work with
© 2015, Raytheon BBN Technologies
80
Other tools and why they matter
• Mallet and GRMM (Graphical Models framework)
– An open source (U Mass) graphical models package which
allows us to stand up sizable Bayes nets and Markov nets
and do approximate inference
– These graph models play a central role in Knowledge
Representation (KR) and are helpful especially when it
comes to dealing with fuzziness and noise in the data
– Further, there is great interest in fusing graph models with
subset of first order logic. The field that deals with such
fusion is called Statistical Relational Learning. In order to
explore this field in any detail, it is essential to have a good
graphical models package.
© 2015, Raytheon BBN Technologies
81
Future Possibilities
• Other Application Domains
– Anomaly detection without a priori training datasets
– Large scale process compliance metrics
– Network attack detection
– Heuristic resource optimization (e.g. for ISR missions)
• Sample Research Topics
– Learning Processes from data
– Automated extraction of indicators from unstructured sources
(intel reports, open sources, imagery, MASINT, etc.)
– Highly-distributed Graph Analysis (using cloud architecture)
– Learning ontologies (relationships) from raw data
© 2015, Raytheon BBN Technologies
82
Summary
• Traditional strength in Semantic Web technology
– Data modeling
– Data integration
– Reasoning
• Expanding in Information Exploitation
– Semantic graph analysis, similarity, clustering
– Process matching
– Optimization
– … with focus on very fast approximations with large
scalability potential
© 2015, Raytheon BBN Technologies
83
Conclusion
• BBN offers a wealth of opportunities in Knowledge Engineering and Data Science, with applications, relevant to many customer domains
• Our experience, versatility and success make us a valuable resource and an exciting research partner
Over 15 years of experience researching, developing & deploying Semantic Web
applications to DoD customers – a solid track record.
• We actively seek new members for our research teams:
– Arlington: Information Exploitation, Sensor Systems
– Columbia: Speech & Language, Cybersecurity
– Visit: careers.bbn.com 84
© 2015, Raytheon BBN Technologies