7
Association Discovery Framework in WebTAS M. Heidi McClure 1 , Larry Rose 1 , Roger J. Dziegiel, Jr 2 1 Intelligent Software Solutions, Inc Colorado Springs, CO 80919 2 Air Force Research Laboratory (AFRL) Rome, NY 13441 AbstractAssociation Discovery Framework (ADF) allows rules engines to be integrated with WebTAS so that the rules engines may associate related records. The associations, or relationships as they are known in WebTAS, include system generated confidence values and also allow for user specified confidences. Filtering may be applied to views in WebTAS so that only records with a user confidence are honored, so that only confidences above a threshold are displayed and so that a combination of user and system confidences are used. This paper describes one implementation of this framework that uses the Seer complex event processing system as its rules engine. Keywords: WebTAS, Seer, complex event processing, association discovery 1. Introduction WebTAS, the Web-Enabled Temporal Analysis System, is software that provides data access, query, visualization, analysis and reporting capabilities to government customers. The Asso- ciation Discovery Framework (ADF) provides WebTAS with an optional server component that searches for related records based on customizable rules or complex event processing (CEP) implementations. ADF is able to examine data, searching for results that match defined patterns. Once discovered, these results are captured by ADF with a relationship or association being created between related records. The relationship between the records is given a confidence or score between 0 and 100% that identifies the system’s confidence in the existence of the discovered relationship. Since analysts do not generally trust everything computers tell them, ADF also provides the ability for users to add their own confidences to any relationship. The rest of this paper will provide some background about WebTAS, about its ability to capture and visualize relationships between records, about its ability to capture confidences both in relationships and in records, about the Seer complex event processing system that is a component of WebTAS and about the named entity extraction capability that is also a component of WebTAS. Next, this paper will discuss the ADF design including server components, its configuration, its pluggable nature and its visualization enhancements for confidence fil- tering. Then this paper will provide discussion of a practical example that has been demonstrated using the ADF and Seer as a reference implementation of a rules engine. Finally, some conclusions and ideas for future work will be discussed. 2. Background 2.1 WebTAS Web-Enabled Temporal Analysis System (WebTAS) is de- scribed in [1]. WebTAS allows visualization and analysis of data that may come from many disparate sources. The sources may be, for example, relational databases, live streams of data or text data from files on a file system. In addition to being able to map in and visualize data from external sources, WebTAS also contains a native database which may be based off of many standard database solutions - Oracle, Sybase, SQLServer, PostGres, Access, etc. The WebTAS native database provides a place for users to create additional supporting database tables and to collect any other information of interest to them. In some installations, only the native database is used, but usually a mixture of native and external sources are viewed and analyzed in WebTAS. Once the mapping of data sources is complete, users may visualize data in tables, timelines, graphs, charts, maps or link analysis charts. WebTAS includes a rich access control capability that is role based and may be tied into existing LDAP environments. The controls may grant or deny users’ access to types of data and may grant or deny users’ ability to run parts of the application. In addition to basic visualization of data, WebTAS also supports a plug in architecture and is delivered with addi- tional optional features like text categorization (also known as text classification), text clustering, named entity extraction (discussed in Section 2.5), Seer (discussed in Section 2.4). 2.2 Relationships Any record viewable in WebTAS may have a relationship generated between that record and any other record. This means that any two records including records that reside in different databases may be related using WebTAS’s relationship func- tionality. The relationship records are stored inside WebTAS’s native database and contain enough reference information to link to the associated records - that is, the relationship records contain primary key information along with the data source and

Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Embed Size (px)

Citation preview

Page 1: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Association Discovery Framework in WebTAS

M. Heidi McClure1, Larry Rose1, Roger J. Dziegiel, Jr21Intelligent Software Solutions, Inc

Colorado Springs, CO 809192Air Force Research Laboratory (AFRL)

Rome, NY 13441

Abstract— Association Discovery Framework (ADF) allowsrules engines to be integrated with WebTAS so that the rulesengines may associate related records. The associations, orrelationships as they are known in WebTAS, include systemgenerated confidence values and also allow for user specifiedconfidences. Filtering may be applied to views in WebTAS sothat only records with a user confidence are honored, so thatonly confidences above a threshold are displayed and so thata combination of user and system confidences are used. Thispaper describes one implementation of this framework that usesthe Seer complex event processing system as its rules engine.

Keywords: WebTAS, Seer, complex event processing, associationdiscovery

1. IntroductionWebTAS, the Web-Enabled Temporal Analysis System, is

software that provides data access, query, visualization, analysisand reporting capabilities to government customers. The Asso-ciation Discovery Framework (ADF) provides WebTAS withan optional server component that searches for related recordsbased on customizable rules or complex event processing (CEP)implementations. ADF is able to examine data, searching forresults that match defined patterns. Once discovered, theseresults are captured by ADF with a relationship or associationbeing created between related records. The relationship betweenthe records is given a confidence or score between 0 and 100%that identifies the system’s confidence in the existence of thediscovered relationship. Since analysts do not generally trusteverything computers tell them, ADF also provides the abilityfor users to add their own confidences to any relationship.

The rest of this paper will provide some background aboutWebTAS, about its ability to capture and visualize relationshipsbetween records, about its ability to capture confidences bothin relationships and in records, about the Seer complex eventprocessing system that is a component of WebTAS and aboutthe named entity extraction capability that is also a componentof WebTAS. Next, this paper will discuss the ADF designincluding server components, its configuration, its pluggablenature and its visualization enhancements for confidence fil-tering. Then this paper will provide discussion of a practicalexample that has been demonstrated using the ADF and Seer

as a reference implementation of a rules engine. Finally, someconclusions and ideas for future work will be discussed.

2. Background2.1 WebTAS

Web-Enabled Temporal Analysis System (WebTAS) is de-scribed in [1]. WebTAS allows visualization and analysis ofdata that may come from many disparate sources. The sourcesmay be, for example, relational databases, live streams of dataor text data from files on a file system. In addition to being ableto map in and visualize data from external sources, WebTASalso contains a native database which may be based off ofmany standard database solutions - Oracle, Sybase, SQLServer,PostGres, Access, etc. The WebTAS native database provides aplace for users to create additional supporting database tablesand to collect any other information of interest to them. In someinstallations, only the native database is used, but usually amixture of native and external sources are viewed and analyzedin WebTAS.

Once the mapping of data sources is complete, users mayvisualize data in tables, timelines, graphs, charts, maps orlink analysis charts. WebTAS includes a rich access controlcapability that is role based and may be tied into existing LDAPenvironments. The controls may grant or deny users’ access totypes of data and may grant or deny users’ ability to run partsof the application.

In addition to basic visualization of data, WebTAS alsosupports a plug in architecture and is delivered with addi-tional optional features like text categorization (also knownas text classification), text clustering, named entity extraction(discussed in Section 2.5), Seer (discussed in Section 2.4).

2.2 RelationshipsAny record viewable in WebTAS may have a relationship

generated between that record and any other record. This meansthat any two records including records that reside in differentdatabases may be related using WebTAS’s relationship func-tionality. The relationship records are stored inside WebTAS’snative database and contain enough reference information tolink to the associated records - that is, the relationship recordscontain primary key information along with the data source and

Page 2: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

table of the related records. WebTAS also supports data drivenor derived relationships which are similar to database joinswhere relationship definitions describe how to relate recordsbased on existing data - for example, if two tables containautomobile make, model and year, relationships may be definedthat will associate matching records in the different tableswhere the make, model and year of the vehicles match.

When relationships are available, data may be visualized inlink analysis charts. These are charts where nodes represent therecords and the lines or links between the nodes represent therelationships. Link analysis charts help in visualizing how dataare related and are the primary visualization tool for networkanalysis.

2.3 Confidences in WebTASA confidence is a number between 0% and 100% represent-

ing the system’s or user’s impression of how correct or accuratethe data is - their belief in the existence of the data. In WebTAS,these confidences may be added to any record that WebTAS cansee. Since relationships stored in the WebTAS native databaseare just another type of record, they, too, may have confidencevalues. In other words, both records and the relationshipsbetween them may have confidence values associated withthem. There are two types of confidences - System and User -ADF creates the system confidences and users of WebTAS addor modify the user confidences. Confidences may be applied toboth native and external records.

2.4 SeerSeer is described in [4]. Seer is a fuzzy complex event

processing system which means that it can describe and searchfor patterns in data that may include patterns in temporaland geo-spatial data in addition to patterns in its other data(non-spatial, non-temporal). Being fuzzy means that Seer doesnot need to perform exact matches when comparing data intime and space. Seer helps intelligence analysts make sense ofcomplex data sets and when used with WebTAS, the data setsmay include data from many disparate sources. Seer supportsprediction using Bayesian reasoning.

Seer is able to represent models that use more complexprobabilities or ones that use fairly simplistic confidence incre-ments or factors when reasoning on event states. That is, theassessment strategy used may be probability based, confidenceincrement based or confidence factor based. Note that Seerconfidences are different from the confidences in WebTASdescribed earlier. Confidences in Seer are a part of the modeldescribing the pattern of interest. Each event state may have aconfidence or probability setup to help with Seer’s calculationof a total confidence for a pass through the model which isalso known as an assessment. The assessment confidence willbe used by ADF when Seer is configured to be a processing

engine but the Seer confidences and WebTAS confidences aredifferent entities.

Probability based reasoning uses Bayesian algorithms forcalculating probabilities based on each state transition havinga probability based on the success (true state) of the precedingstate [2]. Confidence increment reasoning sums each eventstate’s confidence and presents a confidence number for theoverall success of the model. In this case, negative confidenceincrements are also allowed and basic logic is used when thereare alternative paths through the model (ORs) or collaborativepaths (ANDs). The logic for confidence increments is to simplyadd the confidences and when there are ANDs use the lowestconfidence sum along a branch and if ORs, use the highest.For confidence factors, a slightly different logic is used whenthere are missing states. The equation for confidence factorsis PrevConfT + CurConfS − (PrevConfT ∗ CurConfS),where PrevConfT is the total previous confidence andCurConfS is the current state’s confidence.

In addition to Seer computing a confidence or probabilitynumber for the current state of a model, alerts may also begenerated from Seer so that timely and perhaps preemptiveactions may be taken when confidence or probability has passeda threshold number.

2.5 Named Entity ExtractionWebTAS contains an optional plug-in for performing named

entity extraction on text fields in databases or on text docu-ments. Examples of named entity types are people, organiza-tions, dates and locations. Once extracted, the named entitiesare stored in an “Entity Sources” table with links back tothe documents or records that contained the named entity.The named entity extraction plug-in supports various entityextraction implementations including one based on GATE [3],one based on SRI’s PAL Semantic Extraction1 and one onJanya’s Semantex2.

Named entity extraction is just one way to make more senseout of free-text data. By having the extraction find people andorganization names, for example, analysts are able to reduce thenumbers of documents they must read as they do their analysis.The ADF example described in Section 4 uses the results ofnamed entity extraction performed on messages.

3. DesignAs noted earlier, ADF adds to WebTAS the ability to dis-

cover relationships between records and provides visualizationenhancements that assist the user in managing the discoveredrelationships. The general architecture of ADF is as shownin Figure 1. The server components include processing and

1https://pal.sri.com/Plone/framework/Components/learning-methods/semanticextraction

2Janya is no longer in business

Page 3: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Fig. 1: Association Discovery Framework

learning engines which adhere to interfaces that the ADF medi-ator understands. The role of the ADF processing and learningengines is to discover the related records of interest. TheADF mediator takes information from the processing enginesand creates WebTAS relationship objects based on mediatorconfiguration. The client enhancements to support ADF includefiltering of visualized results based on WebTAS confidencelevels - either user-only confidences or a combination of userand system generated confidences. Intelligence analysts areskeptical of computer discovered relationships, so it was veryimportant to provide a way to filter relationships that humanshad not yet reviewed and validated.

The rest of this section discusses some of these componentsin more detail and also discusses the configuration and plug-gable capability of ADF.

3.1 ADF Mediator

The ADF mediator is a server component that monitorsmodel results and creates relationships appropriately (see Sec-tion 3.3). In our initial implementation, a model is directlyrelated to a Seer model. The mediator is also able to examinerelationships that have already been created and will updatethem based on current model execution results. In additionto creating the relationships, additional information is createdby the mediator that captures the pedigree of the relationship.This includes the name of the processing or learning engine’smodel that created the relationship and the explanation of therelationship.

The ADF mediator runs as a deployed application in a JBoss

Page 4: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Fig. 2: Seer Model

application server 3. The mediator is configured using Spring(described in Section 3.3) and is intended to connect to one ormore processing or learning engines. Seer is an example of aprocessing engine.

3.2 ModelsThe ADF processing and learning engines are assumed to

use a model (like a Seer model) that will present to the ADFmediator suggestions for relationship creation. Each model re-sponsible for one type of relationship between two record types,for example, between people and message traffic. Multiplemodels may create the same type of relationship between thesame types of records. Many models may be running that createa variety of relationships between records.

3.3 Configuration and Being PluggableFor relationship configuration, at start up, the ADF server

reads a configuration file which specifies models, record typesand the relationships to create between the records. The ADFuses the Spring Framework4 and the relationship configurationis based off of constructor type injection in Spring. Addi-tionally, the Spring Framework is used to allow a pluggableframework so that various processing or learning engines may

3http://www.jboss.org4http://www.springsource.org/

be used. In our current implementation, the Seer complex eventprocessing environment is plugged into the ADF frameworkusing the Spring configuration. Other supporting pieces of theADF mediator are defined and configured in Spring.

The ADF allows for additional rules or processing enginesto be placed into the system. The idea is that Seer is aninitial implementation of a processing or learning engine. Otherengines based perhaps on JBoss BRMS (previously Drools)5 orJess6 may be plugged into the ADF to provide alternative meansof identifying related objects. To plug in additional processingor learning engines, each engine will need to provide a smallamount of code that implements the interfaces that the mediatorunderstands. Then the system is configured to know about thenew engine.

3.4 Client Enhancements to Support ConfidenceFiltering

WebTAS has been enhanced so that it provides better ways tofilter based on confidences and better viewing of related objectdetails. Figure 3 shows a details pane for showing details ofrelationships including both analyst and system or calculatedconfidences and Figure 4 the display after confidence filteringhas been applied to only show analyst verified confidences

5http://www.redhat.com/products/jbossenterprisemiddleware/business-rules/6http://herzberg.ca.sandia.gov/

Page 5: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Fig. 3: Confidence Filtering and Details Pane

Fig. 4: Confidences Filtered

over 50%. Notice in this figure, there are missing “Mention”relationships. These are relationships that the analyst has notyet verified and that have been filtered.

4. Application with Extracted EntitiesThis section describes an implementation of the ADF frame-

work that uses Seer as a processing engine. The goal of thisimplementation is to create relationships between Identities ofInterest (people) and Message Traffic records that mentionthose people. In this case, the Identities of Interest havea primary name and a list of aliases. The models will beconfigured to place more confidence on a primary name match

than on an alias match. Models will look at all known nameswhen trying to find matches in the Message Traffic.

With this ADF implementation, analysts can be readilydirected to the Message Traffic that mentions the people theyare looking for. ADF does not replace the job of reading themessages, it only helps to point the analyst to documents thatmay be of most interest to their current focus.

4.1 Seer Model for Person to Message TrafficThis section will describe what Seer models look like using

an example that will identify messages (Message Traffic) thatmention people’s names based on previously performed namedentity extraction.

The model first looks for the names of the people we areinterested in finding in the extracted information - these peopleare stored in an Identities of Interest table in the system.Second, the model looks for those names in the named entityextraction records. Third, in this last step, the model looks forthe actual message that mentions the person.

In Figure 2, the Seer model builder is displayed. Theboxes and arrows represent the three steps just described.The first state labeled IoI does a query “Identity of Inter-est Observed”. The second state does a more complex query“Entity Sources Observed Where Entity Type Equals PER-SON And Source Object.ClassName Equals Message Traf-fic And Entity Hits.Entity Value Equals $(IoI.Display Name)”.This is a query of the Entity Sources records. Entity Sources arethe named entity extraction results. This query is constrainedto only look for PERSON named entity types that are referringto Message Traffic records and that have the named entityvalue that matches the Identity of Interest queried in the firststate. The $(IoI.Display Name) syntax means that the query,at run time, will refer to the results of the IoI event state(the first state) and will use the Display Name attribute of theIdentity of Interest. The third state uses the “Source ObjectID” information from the “EntitySrc” state (second state) inorder to find the actual Message Traffic record that mentionedthe Identity of Interest’s name. The third query looks like this:“Message Traffic Observed Where Object Id Equals $(Enti-tySrc.Source Object.ID)”.

This model uses the Display Name of the identity. Anothermodel we use looks at the list of aliases for the identityfollowing similar logic to identify any messages that mentionthe aliases of the identity.

The assessment strategy used in this Seer model is theconfidence increment strategy. Here, each state may be givena confidence which will be summed to come up with theconfidence in the instance of model execution. For example,if we have an exact name match, we may place an 80% onthe last state. If we are running the model based on aliases,we may place a smaller confidence, say 60%, on the last state.

Page 6: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Since all states in the model must be met for success, the firstand second states may be left at 0% confidence.

4.2 Configuration

To create the correct “Mentions” relationship, the mediatoris configured to look for Seer-based ADF results where thereare records of type “Identity of Interest” and of type “MessageTraffic” and when they are found, it will create a “Mentions”relationship between them. As part of the relationship creation,the mediator will also get the system confidence identified bythe Seer-based ADF processor. That is, as described in Section4.1, 80% if it is a primary display name match, 60% if it is analias match.

4.3 Visualization

When an analyst sits down at their WebTAS station, they areable to see the system generated relationships in a couple ofplaces. The first is in the link analysis charts. Refer to Figures3 and 4. The Mention relationships (lines between nodes) haveall been generated using ADF. When a link is highlighted asshown in Figure 3, the details of this particular relationshipare visible in the details pane on the right of this display. Theuser may right click on the Analyst Confidence and changetheir assessment of the relationship. As shown in the right sidedisplay, the analyst may also choose to show only those recordssomeone has reviewed and validated - that is, only show recordsthat have an Analyst Confidence assigned. When they filter thisway, some of the Mention links between nodes disappear..

When analysts drill down into the details of a record, theywill have a related objects tab as shown in Figure 5. This figureshows an example of an organization record (“Southern Bloc”)with its related objects tab highlighted. In this view, users canexamine the related object (in this case, “In Sight Crime”)and expand the type of relationship (“Mention”) to see moreinformation about the pedigree of this relationship. This is thesame relationship highlighted in the details pane view in Figure3.

5. ConclusionWe have presented our current work that provides an asso-

ciation discovery framework (ADF) as part of WebTAS. ADFassists users by discovering relationships between records in theuser’s domain of data. The computer generated relationshipsare given a system confidence so that the user may distinguishbetween relationships that the ADF-created and ones that usershave vetted. Because the ADF is pluggable, this frameworkprovides an environment where other processing, rules orlearning systems may be integrated and analyzed.

6. FutureNow that the basic framework is in place, we would like

to try other rules engines like Jess or JBoss BRMS andperhaps test out other learning algorithms or association miningalgorithms.

AcknowledgmentsWe would like to thank Michael Shai who helped to make

our graphics look better.Work described in this paper was funded by Air Force

Research Laboratory (AFRL) at Rome, NY (AFRL contractFA8750-09-D-0022) and the project was called Smart TargetFolders.

The views expressed are those of the authors and do notreflect the official policy or position of the Department ofDefense or the U.S. Government.

References[1] Webtas overview, 2013. Intelligent Software Solutions -

http://www.issinc.com/programs/webtas.html.[2] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge

University Press, 2012.[3] Hamish Cunningham, Diana Maynard, Kalina Bontcheva, and Valentin

Tablan. Gate: A framework and graphical development environment forrobust nlp tools and applications. In Proceedings of the 40th AnniversaryMeeting of the Association for Computational Linguistics (ACL’02), 2002.

[4] M. Gerken, R. Pavlik, C. Houghton, K. Daly, and L. Jesse. Situationawareness using heterogeneous models. In Collaborative Technologies andSystems (CTS), 2010 International Symposium on, pages 563 –572, may2010.

Page 7: Association Discovery Framework in WebTASworldcomp-proceedings.com/proc/p2013/ICA2157.pdf · Association Discovery Framework in WebTAS ... capability that is role based and may be

Fig. 5: Related Objects Tab