Automated Provenance Capture Architecture and Implementation

Automated Provenance Capture Architecture and

ImplementationKaren Schuchardt, Eric Stephan,

Tara Gibson, George Chin

PNNL

Methodology• Simulated using Kepler workflow system. We did not attempt to

leverage looping.• Programmed stub actors for each step with proper inputs,

outputs, and user controlled parameters.• Implemented an execution event listener that optionally records

workflow. No changes to core Kepler were made. • Applied/extended our existing content management/provenance

system to see how far we could go with it.• Implemented actors /

workflows for queries/visual analysis using xslt/graphviz

Provenance Capture Architecture

NamingService

ContentStore

WorkflowEngine

Prov CaptureService

Metadataextraction

QueryService

TripleStore

Provenance/Content System

ProvService

Translation

AnalysisTools

QueryTools

BrowsingTools

Client ToolseventsWorkflow

Tools

WorkflowUI

Indexing

AnnotationTools

HarvestingTools

SDG Provenance Capture Implementation

URL,LSID

ContentStore

KeplerEngine

Prov CaptureService

Metadataextraction Triple

Store

SAMTranslation

Node x NodeComparison

Bategalj/Mrvar

Algorithm

DavExplorer/Ecce

Client Tools

eventsWorkflow Tools

KeplerUI

All within Kepler

Nettool

Keplerworkflows

SEDASL

Lucene

Defuddle

GML,webDav

URIQA (rdf),webDAV

Queryprocessor

Provprocessor

URIQA(rdf),/webDAV

TripleHarvesters

Physical Model

Named Thing

Property

Content

0..n

1

Property can be alink or a value

Any “thing”, for which we want to capture some information, is given a unique id with which properties and relationships can be associated. Additionally, content can be associated with these “things”.

Logical Overlay

Workflow InstancestartedExecutionfinishedExecution Actor Instance

startedExecutionfinishedExecution

creatorwasRunByowningInstitutioncreatedtitle

Port Valueformatcreated

hasSource

hasOutput

[arbitrary triple]*

isPartOf

isPartOf

title

isInput

[arbitrary triple]*

[arbitrary triple]*

titleformat

Parameter

hasParameter

formattitle

hasValue ORhasHashOf Value

[arbitrary triple]*

hasValue ORhasHashOf Value

format

11

1

0..n1

0..n

0..n

hasStatus

createdWithuid

uid

uid

uid

Semantically Extended DASL Queries

Select- all properties or a specific list- format (gxl, rdf, webDAV)

Scope- a url or query (i.e. 2 phase) - names of properties to follow (and direction)- stop conditions (property/values comparisons, depth)

Where- property name/value comparisons, content search

Workflow Comparisons• Node-by-node comparisons

– Nodes match if all node attributes and incoming and outgoing edges match

– Nodes are similar if attributes and edges match to some specified XX%

• After node comparisons, edges are compared– Edges match if connecting nodes were

found to be exactly matching or similar and edge attributes match

– Edges are similar if attributes match to some specified XX%

• Outputs include: – Matching or similar nodes, – Matching or similar edges, – Nodes only in first or second graph, – Edges only in first or second graph

•title•instantiationOf•source•value•format

isPartOf-reverseisInput-reverse

hasOutput

isInput hasOutput-reverseisPartOf

Nodes only in First Graph:node52 (atlas-z.gif, )node14 (imageformat, )node53 (convertyimage,)node57 (atlas-y.gif, )node36 (convertzimage)node34 (imageformat, )node15 (imageformat, )node78 (atlas-x.gif, )node26 (convertximage)Count: 9

Workflow Graph Distances

Implements social network algorithm based on triad census (Batagelj and Mrvar, 2001; Chin, Whitney, Powers, and Johnson, 2004)

Examines every three possible nodes of a workflow graphEvery three possible combination fall into 1 of 64 possible triad statesCensus is counts of triads that exist in the graph, which may be used to summarize or profile overall graph structureDistance is computed by taking Euclidean distance of two triad censuses which is normalized to a 0.0..1.0 value

Most useful for assessing similarity across large, complex workflowsDistance computed for two workflow graphs: 0.095888

(0, 4, 6, 1,…)

(0, 4, 6, 1,…)

What’s Cool– Combined rdf assertions with

scientific content management• flexible capabilities for metadata

extraction (e.g. Defuddle to extract data from warp file).

• Existing rdf harvestors could be plugged in through same mechanism

• Extensible translation mechanism (browse tools can provide views of raw data such as a table of warp parameters)

– Conceptually simple model that can apply to much more than workflow execution.

– Readily adaptable to alternative models, constructs, relationships.

– Indexing and Query of content or metadata

– All relationships are reverse indexed automatically. You can search up or down and event mix directions on specific properties.

– Flexible event based model so as to minimize connections into workflow engine

– Actors can contribute their own metadata easily through events

– User control over which actors to capture provenance on

– Automatic content type determination

– Multiple output formats– Capability to capture hashes

instead of values– Leveraged DASL extension

mechanisms– Based on existing standards

(http) – existing tools can be leveraged

– Pluggable authentication model base on JAAS

– Everything is open source

Limitations

– Prov capture is slow. We do one assertion at a time currently but they could all be packaged up into one request

– RDF predicates can’t contain special characters but things like parameters often have these characters.

– SAM can be made to work but current implementation based on WebDAV ties resources to metadata. We had to create dummy resources.

– SAM not rdf based.– Big files (reference images) are duplicated as part of

provenance tracking because they are data inputs to multiple actors.

– Did not get to LSID service but it would be nice if this wasn’t a separate protocol to deal with.

Kepler/Workflow Comments

– Decided to stay with brute force model instead of loop based model. Loop based model would probably introduce controller actors that would obscure the provenance capture.

– Issue of what to capture provenance on for more general workflows– Coding actors for each thing you want to do doesn’t scale and is a

barrier to adoption by scientists.– Can’t control actor firing order which resulted in things like

AlignWarp4 producing warp1.warp– We used string constant actors to supply input files but it makes

more sense for Kepler to support the concept of a data source.– We could not tell if a port value was a file except by using

File.exists()– Would like to see events be external for complete separation from

workflow engine.

Out of (Current) Scope

– Dynamically changing and continuing workflows (ie evolving workflows)

– Pointing back to provenance on Actors. A real system would do this and the Actors themselves would have global ideas that could be referenced.

– Capturing provenance on workflow descriptors and pointing back to them (same as above for actors)

– Use of lsids – we have service running but never got to the point of inserting it. Instead we used a url name generation service.

– Signing results

Brainstorming Categorization

• How data was generated– User set parameter values– Workflow

structure/execution capture– Outside tools– Auto-generated

metadata/content

• The structure of the query– 2 phase query (or recursive)– Specifying wht to include– Specifying what to exclude

• What it will be used for– Exploratory analysis– Directed query to answer

a specific question– Debugging– Verification– Comparison

Documents

Automated Provenance Capture Architecture and Implementation