View
342
Download
0
Category
Preview:
Citation preview
YesWorkflow: More Provenance Mileage from Scientific Workflows
and Scripts!
Bertram Ludäscher Director, Center for Informatics Research in Science and Scholarship (CIRSS)
Professor, Graduate School of Library and Information Science (GSLIS) Faculty affiliate, NCSA & Department of Computer Science
Outline • All things “Provenance” … • Provenance: Why should you care? • Provenance in Databases
– Why-‐, How-‐, …, Why-‐Not Provenance
• … vs Provenance in ScienCfic Workflows • YesWorkflow: Doing more (someCmes with less)
More Provenance Mileage from Workflows and Scripts 2
Provenance Palooza • Provenance
– … or provenience? • Chain of custody • Lineage • Pedigree • Genealogy • Phylogeny • History • Origin
More Provenance Mileage from Workflows and Scripts 3
Provenance Research everywhere … … and here:
More Provenance Mileage from Workflows and Scripts 4
Provenance as we all know it • Oxford English Dictionary:
– coming from some particular source or quarter; origin, derivation
– the history or pedigree of a work of art, manuscript, rare book, etc.
– concretely, a record of the passage of an item through its various owners (“chain of custody”)
• Merriam-Webster: – prov·e·nance noun \ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s\ – the origin or source of something
• Origin: – French, from provenir to come forth, originate, from Latin
provenire, from pro- forth + venire to come
More Provenance Mileage from Workflows and Scripts 5
Provenance
6 More Provenance Mileage from Workflows and Scripts
Is this a real Leonardo? Lack of reliable Provenance casts a doubt on this …
Pedigree
7 More Provenance Mileage from Workflows and Scripts
More Provenance Mileage from Workflows and Scripts 8
Natural History: Understanding what happened…
Zrzavý, Jan, David Storch, and Stanislav Mihulka. EvoluIon: Ein Lese-‐Lehrbuch. Springer-‐Verlag, 2009.
Author: Jkwchui (Based on drawing by Truth-‐seeker2004)
Provenience vs Provenance
More Provenance Mileage from Workflows and Scripts 9
More Provenance Mileage from Workflows and Scripts 10
Society of American Archivists hVp://www2.archivists.org/glossary/
terms/p/provenance
• Principle of provenance (respect des fonds)
• Keep records of different origins separate to preserve context
Archivists
So what is “provenance” (sensu W3C) ?
• Provenance refers to the sources of informaIon, including en11es and processes, involving in producing or delivering an ar1fact (*)
• Provenance is a descripIon of how things came to be, and how they came to be in the state they are in today (*)
• Provenance is a record that describes the people, ins1tu1ons, en11es, and ac1vi1es, involved in producing, influencing, or delivering a piece of data or a thing in the world
More Provenance Mileage from Workflows and Scripts 11
Outline • All things “Provenance” … • Provenance: Why should you care? • Provenance in Databases
– Why-‐, How-‐, …, Why-‐Not Provenance
• Provenance in ScienCfic Workflows • YesWorkflow: Doing more (someCmes with less)
More Provenance Mileage from Workflows and Scripts 12
Provenance => Transparency • = “Externally-‐facing” provenance – “Them-‐Provenance”
• Later: “Internally-‐facing” provenance – “Me-‐Provenance”
More Provenance Mileage from Workflows and Scripts 13
Climate Change: Whodunnit?
More Provenance Mileage from Workflows and Scripts 14
Tracing the sources (data, code)
More Provenance Mileage from Workflows and Scripts 15
From “Climate Gate” to Reproducible Science
16 More Provenance Mileage from Workflows and Scripts
Data & Provenance Management: Single Model
17 More Provenance Mileage from Workflows and Scripts
Data & Provenance Management: Model Chains
18 More Provenance Mileage from Workflows and Scripts
Some things people do with “provenance”
• Result validaCon • Result debugging (science vs wf logic) • Reproducibility and Repeatability • ExplanaCon (derivaCons, traces, proof trees) • RunCme monitoring
– Profiling, benchmarking
• Performance OpCmizaCon (“smart re-‐run”) • Fault-‐tolerance, crash-‐recovery • Database view maintenance (e.g. data warehousing) • …
19 More Provenance Mileage from Workflows and Scripts
Provenance for Virtual Joint Experiments
• How do we ensure that Charlie gets a complete account of the history of Wc’s outputs?
• How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? è traces TA and TB will be critical è need to compose them to obtain TC
We can view the composiCon WC as a new, virtual workflow
Charlie
Alice
(1) develop! WA
(2) run! RA
z x Bob
(3) develop!WB
(5) run!RB
v u f
v
WC:=
(6) inspect
provenance!
(7) understand,
generate!W
A W
S W
B
u z x
(4) data sharing!
TA! TB!f -1
More Provenance Mileage from Workflows and Scripts 20
Open Provenance Model => W3C Prov
More Provenance Mileage from Workflows and Scripts 21
W3C Prov: One size fits all?
More Provenance Mileage from Workflows and Scripts 22
Outline • All things “Provenance” … • Provenance: Why should you care? • Provenance in Databases
– Why-‐, How-‐, …, Why-‐Not Provenance
• Provenance in ScienCfic Workflows • YesWorkflow: Doing more (someCmes with less)
More Provenance Mileage from Workflows and Scripts 23
Types of Data Provenance • Black-box
– know (next to) nothing at compile-time – at runtime: keep some data lineage – most prov sensu WF work use this
• White-box – statically (compile-time) analyzable – q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2) – Most prov sensu DB work use this
• Grey-box – can “look inside” (some black boxes) – … e.g. b/c they have subworkflows – … or FP signatures: A :: t1, t2à t3,t4 – … or semantic annotations (sem.types)
f
A
q
t1 t2
t3 t4
X1 X2
Y1 Y2
More Provenance Mileage from Workflows and Scripts 24
Provenance in Databases
More Provenance Mileage from Workflows and Scripts 25
Source: Val Tannen
Provenance in Databases
More Provenance Mileage from Workflows and Scripts 26
Source: Val Tannen
Provenance in Databases
More Provenance Mileage from Workflows and Scripts 27
Source: Val Tannen
AbstracQng the structure of querying
More Provenance Mileage from Workflows and Scripts 28
Source: Val Tannen In database provenance, tuples are either combined conjunc1vely (*) or disjunc1vely (+) è That’s the core model!
Provenance Polynomials One Semiring to Rule them all! (DB theory strikes!)
More Provenance Mileage from Workflows and Scripts 29
Green, Karvounarakis, Tannen. Provenance semirings, PODS, 2007
Unifying most prior work in a simple model!
Example: Go from X to Y in 3 hops! (e.g., a = CS b = NCSA c = GSLIS)
• Database: hop(X,Y) :=
• Query: 3hop(X,Y) :-‐ hop(X, Z1), hop(Z1, Z2), hop(Z2,Y).
More Provenance Mileage from Workflows and Scripts 30
a
p
bq
rcs
Note: Can not go from c to a in 3hops!
a
ppp+pqr+qrpbppq+qrq
cpqsppr+qrr
rpq
rqs
hop(a,a, p). hop(a,b, q). hop(b,a, r) hop(b,c, s).
3hop(a,a, p3+2pqr). 3hop(a,b, p2q+q2r). … 3hop(a,c, pqs).
Provenance Polynomials
More Provenance Mileage from Workflows and Scripts 31
,,Mein Schatz!”
p3 + 2pqr
p3 + pqr p + 2pqr
p + pqr
pqr
p + pqr
p
a
ppp+pqr+qrpbppq+qrq
cpqsppr+qrr
rpq
rqs
32 More Provenance Mileage from Workflows and Scripts
Provenance in Databases
NegaQon & Why-‐Not Provenance
More Provenance Mileage from Workflows and Scripts 33
• Provenance Semirings work well for: – PosiQve Queries (e.g., RA+ )
• Challenges: Handling of – set difference (~ negaQon) – Why-‐not provenance – Missing Answer provenance
• A fresh look at provenance! • … using an old idea: Game semanQcs!
Provenance (or Query EvaluaIon) Games
More Provenance Mileage from Workflows and Scripts 34
“SLD-‐resoluQon game” A(X) :– B(X,Y,Z) … not C(X,Y) …
Eureka! [KLZ13] Köhler, S., Ludäscher, B., & Zinn, D. (2013). First-‐order provenance games.
In Search of Elegance in the Theory and PracIce of ComputaIon. Springer
TranslaQon: Q(I) => G Q(I)
More Provenance Mileage from Workflows and Scripts 35
A(X)
C(X)
B(X,Y )
r2(X,Y )g12(X,Y )
g22(Y )
rB
(X,Y )
rC
(X)
¬A(X)
¬B(X,Y )
¬C(X)
B(X,Y )
C(X)X:=Y
9Y
(a) Game template for QABC
: A(X) :� B(X,Y ),¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB
(b, a)
r2(b, a)¬A(b)
¬A(a)
g12(a, a)
B(a, b)
B(a, a)
C(a)
g22(a)
g22(b)C(b)
¬B(b, a)
¬B(b, b)
rC
(a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g12(a, b) rB
(a, b)
r2(b, b)g12(b, b)
g12(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(b) Instantiated Q
ABC
game on I = {B(a, b), B(b, a), C(a)}.
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB
(b, a)
r2(b, a)¬A(b)
¬A(a) rB
(a, b)B(a, b)
B(a, a)
C(a)
g22(a)
g22(b)C(b)
¬B(b, a)
¬B(b, b)
rC
(a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g12(a, b)
g12(a, a)
r2(b, b)g12(b, b)
g12(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).
Figure 3: Provenance game for Q
ABC
. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.
the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2
Given database I , a template can be instantiated yielding a gamegraph G
Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r
C
(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G
Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q
wm
:=
win(X) :� move(X,Y ),¬win(Y )
when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q
wm
as a “game engine” tosolve the provenance game with a move relation given by G
Q
(I).3
Finally, the solved game is a labeled graph G�
Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of
2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q
wm
to compute (constraint) provenance.
g12(b, c)
g12(b, b)
r2(b, a)
¬B(b, c) B(b, c)
g22(a)
¬B(b, b)
rC
(a)
A(b)
C(a)
B(b, b)r2(b, b)
r2(b, c)
9 c
9 a
9 b
Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.
¬B :x1 6= a,x1 6= b,x2 = a
C :x1 = a
A :x1 = a
A :x1 = b
¬C :x1 6= a
¬A :x1 6= a,x1 6= b
C :x1 6= a
R2 :X = a,Y = a
R2 :X = a,Y = b
B :x1 6= a,x2 6= a
R2 :X 6= a,Y 6= a
RB
:x1 = b,x2 = a
B :x1 = a,x2 = b
A :x1 6= a,x1 6= b
G22 : ¬C :Y 6= a
G12 : B :
X 6= a,X 6= b,Y = a
B :x2 6= b,x1 = a
¬A :x1 = b
¬A :x1 = a
G12 : B :
Y 6= b,X = a
¬B :x1 6= a,x2 6= a
¬B :x1 = a,x2 = b
B :x1 = b,x2 = a
RC
:x1 = a
¬B :x2 6= b,x1 = a
RB
:x1 = a,x2 = b
R2 :Y 6= b,X = a,Y 6= a
G12 : B :
X 6= a,Y 6= a
G12 : B :
X = b,Y = a
B :x1 6= a,x1 6= b,x2 = a
R2 :X 6= a,X 6= b,Y = a
G12 : B :
X = a,Y = b
R2 :X = b,Y = a
¬C :x1 = a
¬B :x1 = b,x2 = a
G22 : ¬C :Y = a
Figure 5: Constraint provenance game for QABC
. Unlike in Figure 3, nodesmay represent finite or infinite sets here.
G�
Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].
3. Constraint Provenance Games
Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.
Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.
In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).
Source [KLZ13]
Solve G Q(I) => Provenance!
More Provenance Mileage from Workflows and Scripts 36
A(X)
C(X)
B(X,Y )
r2(X,Y )g12(X,Y )
g22(Y )
rB
(X,Y )
rC
(X)
¬A(X)
¬B(X,Y )
¬C(X)
B(X,Y )
C(X)X:=Y
9Y
(a) Game template for QABC
: A(X) :� B(X,Y ),¬C(Y ).
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB
(b, a)
r2(b, a)¬A(b)
¬A(a)
g12(a, a)
B(a, b)
B(a, a)
C(a)
g22(a)
g22(b)C(b)
¬B(b, a)
¬B(b, b)
rC
(a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g12(a, b) rB
(a, b)
r2(b, b)g12(b, b)
g12(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(b) Instantiated Q
ABC
game on I = {B(a, b), B(b, a), C(a)}.
¬C(a)
¬C(b)
¬B(a, a)
¬B(a, b)
rB
(b, a)
r2(b, a)¬A(b)
¬A(a) rB
(a, b)B(a, b)
B(a, a)
C(a)
g22(a)
g22(b)C(b)
¬B(b, a)
¬B(b, b)
rC
(a)
A(b)
A(a)
r2(a, b)
r2(a, a)
g12(a, b)
g12(a, a)
r2(b, b)g12(b, b)
g12(b, a)
B(b, b)
B(b, a)
9a
9b
9b
9a
(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).
Figure 3: Provenance game for Q
ABC
. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.
the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2
Given database I , a template can be instantiated yielding a gamegraph G
Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r
C
(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G
Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q
wm
:=
win(X) :� move(X,Y ),¬win(Y )
when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q
wm
as a “game engine” tosolve the provenance game with a move relation given by G
Q
(I).3
Finally, the solved game is a labeled graph G�
Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of
2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q
wm
to compute (constraint) provenance.
g12(b, c)
g12(b, b)
r2(b, a)
¬B(b, c) B(b, c)
g22(a)
¬B(b, b)
rC
(a)
A(b)
C(a)
B(b, b)r2(b, b)
r2(b, c)
9 c
9 a
9 b
Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.
¬B :x1 6= a,x1 6= b,x2 = a
C :x1 = a
A :x1 = a
A :x1 = b
¬C :x1 6= a
¬A :x1 6= a,x1 6= b
C :x1 6= a
R2 :X = a,Y = a
R2 :X = a,Y = b
B :x1 6= a,x2 6= a
R2 :X 6= a,Y 6= a
RB
:x1 = b,x2 = a
B :x1 = a,x2 = b
A :x1 6= a,x1 6= b
G22 : ¬C :Y 6= a
G12 : B :
X 6= a,X 6= b,Y = a
B :x2 6= b,x1 = a
¬A :x1 = b
¬A :x1 = a
G12 : B :
Y 6= b,X = a
¬B :x1 6= a,x2 6= a
¬B :x1 = a,x2 = b
B :x1 = b,x2 = a
RC
:x1 = a
¬B :x2 6= b,x1 = a
RB
:x1 = a,x2 = b
R2 :Y 6= b,X = a,Y 6= a
G12 : B :
X 6= a,Y 6= a
G12 : B :
X = b,Y = a
B :x1 6= a,x1 6= b,x2 = a
R2 :X 6= a,X 6= b,Y = a
G12 : B :
X = a,Y = b
R2 :X = b,Y = a
¬C :x1 = a
¬B :x1 = b,x2 = a
G22 : ¬C :Y = a
Figure 5: Constraint provenance game for QABC
. Unlike in Figure 3, nodesmay represent finite or infinite sets here.
G�
Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].
3. Constraint Provenance Games
Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.
Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.
In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).
Source [KLZ13]
Provenance ~ Query EvaluaQon Game
More Provenance Mileage from Workflows and Scripts 37
Towards Constraint Provenance Games
Sean Riddle Sven Kohler Bertram LudascherDepartment of Computer Science, University of California, Davis, CA 95616
{swriddle, svkoehler, ludaesch}@ucdavis.edu
Abstract
Provenance for positive queries is well understood and elegantlyhandled by provenance semirings [GKT07], which subsume manyearlier approaches. However, the semiring approach does not ex-tend easily to why-not provenance or, more generally, first-orderqueries with negation. An alternative approach is to view queryevaluation as a game between two players who argue whether, forgiven database I and query Q, a tuple t is in the answer Q(I) or not.For first-order logic, the resulting provenance games [KLZ13] yielda new provenance model that coincides with provenance semirings(how provenance) on positive queries, but also is applicable to first-order queries with negation, thus providing an elegant, uniformtreatment of earlier approaches, including why-not provenance andnegation. In order to obtain a finite answer to a why-not question,provenance games employ an active domain semantics and enu-merate tuples that contribute to failed derivations, resulting in a do-main dependent formalism. In this paper, we propose constraintprovenance games as a means to address this issue. The key idea isto represent infinite answers (e.g., to why-not questions) by finiteconstraints, i.e., equalities and disequalities.
1. Introduction
Consider the relation hop(x, y) in Fig. 1a and query Q3hop
:=
r1 : 3hop(X,Y ) :� hop(X,Z1), hop(Z1, Z2), hop(Z2, Y ).
Q3hop
asks for pairs of nodes that are reachable via exactly threeedges (“hops”). If we ask why and how a tuple such as 3hop(a, a)came about, we can use polynomials over a provenance semiring[GKT07, KG12] to get a precise answer, here: p3+2pqr. In Fig. 1awe see that one can “go” from node a to itself in three hops indistinct ways: (i) by using the edge p (= hop(a, a), a self-loop)three times: p·p·p, or p3 for short, (ii) by using the p edge once,followed by q (= hop(a, b)) and then r (= hop(b, a)), so p·q·r,or (iii) by following q, r, and then p, i.e., q·r·p. Since semiringprovenance is commutative, p·q·r + q·r·p = 2pqr as shown inthe figure. Many prior provenance approaches can be understoodas special provenance semirings: e.g., Trio provenance [BSHW06],why-provenance [BKT01], and lineage [CWW00], all yield coarserversion of the provenance p3 + 2pqr of 3hop(a, a), i.e., p+ 2pqr,p+ pqr, and pqr, respectively [KG12].
Provenance through Games. In Fig. 1c we see that 3hop(c, a) isabsent, so 3hop(c, a) is false. We cannot use semiring provenanceto explain why-not, since the approach is not defined for negativequeries and extensions for negation (or set-difference) are not ob-vious [GP10, GIT11, ADT11a, ADT11b]. On the other hand, if anapproach can explain the provenance of ¬A, this naturally providesa why-not explanation for A. In [KLZ13] we proposed an alterna-tive model of provenance that naturally supports negation. Considerthe graph in Fig. 1d. It can be understood as the move graph of aquery evaluation game in which two players argue whether or not
a p
b
q r
c
s
(a) input I ...
hop
a a pa b qb a rb c s
(b) ... annotated.
3hop
a a p3 + 2pqra b p2q + q2ra c pqsb a p2r + qr2
b b pqrb c qrs
(c) 3hop with provenance.
r1(a, a, b, a)
g21(a, a)
¬hop(b, a)
g11(a, a)
hop(b, a)
g21(a, b) g31(b, a)
rhop
(b, a)
r1(a, a, a, a)
r1(a, a, a, b)
3hop(a, a)
g31(a, a)
rhop
(a, a)
hop(a, b)
¬hop(a, a)
g11(a, b)
rhop
(a, b)
g21(b, a)
¬hop(a, b)
hop(a, a)
9 a,a 9 b,a
9 a,b
(d) The game provenance of 3hop(a, a) ...
⇥
+
⇥
+
+
+ +
r
⇥
⇥
+
+
p
+
⇥
+
q
+
⇥
+
(e) ... is p3 + 2pqr.
Figure 1: Each edge hop(x, y) in the input graph I in (a) is annotated(p, q, r, ...) in (b). The answer to Q
3hop
is shown in (c) with provenancepolynomials [KG12]. The game provenance [KLZ13], e.g., of 3hop(a, a)in (d) corresponds to the semiring provenance polynomial in (c): see (e).
a tuple t 2 Q(I). If a player wants to prove that t = 3hop(a, a) isin Q
3hop
, she needs to move to a ground rule r with t in the head,thereby claiming that this rule instance is deriving t. In Fig. 1d,there are three choices, starting from the root node 3hop(a, a): themove to r1(a, a, a, a), to r1(a, a, a, b), or to r1(a, a, b, a). Herer1(x, y, z1, z2) identifies ground instances of r1. There are two 8-quantified variables X and Y occurring in the head and body, andtwo (implicitly) 9-quantified variables Z1 and Z2, occurring onlyin r1’s body. By moving to a ground instance of r1 in the game, theplayer tries to pick values for the 9-quantified variables that makethe rule body true while deriving t in the head. For r1, the middleedge hop(z1, z2) fixes the bindings of Z1 and Z2. For the givendatabase instance I , there are three choices that “work”: (a, a),(a, b), and (b, a). This means that there are exactly three differ-ent ways to obtain 3hop(a, a) via r1 over input I: if we choose thep-hop (a, a) as the middle edge, we have p·p·p; for the q-hop (a, b)we have p·q·r; and for the r-hop (b, a) we have q·r·p.1 The oppo-nent can now challenge each of these claims, by selecting a subgoal
1 Game provenance [KLZ13] can distinguish p·q·r and q·r·p and is thuseven more fine-grained than the provenance semirings in [GKT07, KG12].
Provenance Game on GQ(I) = Provenance Polynomials … for posiQve queries!
Source [KLZ13]
Provenance ~ Query EvaluaQon Game
More Provenance Mileage from Workflows and Scripts 38
… but also works for Why-‐Not provenance & non-‐monotonic queries (i.e., Q can have negaQon) !! Here: not 3hop(c,a) – can’t go back from GSLIS to CS
c a
g21(c, a)
¬3hop(c, a)
g21(c, c)g11(c, c)
r1(c, a, c, b)
¬hop(c, b)
hop(c, a)
g21(b, b)
¬hop(a, c)
hop(c, c)
g11(c, a)
r1(c, a, b, c)r1(c, a, a, b)
3hop(c, a)
hop(b, b)
g21(c, b)g21(a, c)
r1(c, a, a, c)
¬hop(c, c)
hop(c, b)
¬hop(c, a)
g11(c, b)
r1(c, a, b, b)
¬hop(b, b)
g31(c, a)
r1(c, a, a, a) r1(c, a, b, a)
hop(a, c)
r1(c, a, c, a) r1(c, a, c, c)
9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b
Figure 2: Why-not provenance for 3hop(c, a) using provenance games.
gi1 in the body of r1, thus claiming that gi1 is false and hence thatthe r1 instance doesn’t derive t. The first player can counter anddemonstrate that gi1 is true by selecting a rule instance or fact asevidence for gi1. The game proceeds in rounds until some playercannot move and thus loses (the opponent wins). In [KLZ13] itwas shown how the provenance of a tuple t can be obtained via aregular path query over a solved game graph like the one in Fig. 1d:e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved gameas shown in Fig. 1e: for positive queries, solved games representsemiring provenance by noting that won (green) and lost (red) po-sitions correspond to “+” and “⇥” operations, respectively (leavesrepresent input annotations, here: p, q, r, s) [KLZ13].
Why-Not Provenance and the Many Ways to Fail. Since gamesare inherently symmetric (one player’s win is the opponent’s lossand vice versa), the approach yields an elegant provenance modelthat unifies why and why-not provenance. Consider the (dark, red)node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing movesto a node r1(x, y, z1, z2) lead to a position that is won for the oppo-nent. There are 9 such positions, e.g., r1(c, a, c, b) is one of them(third from the right). Recall that an instance of r1 means that onecan do a 3-hop from x to y (here: c to a) via intermediate nodesz1 and z2 (here: c and b). However, in the given database I inFig. 1(a), there is no hop(c, z) – neither for z = b nor for any otherz, since there are no outgoing moves from c. In this case, the op-ponent can successfully attack the goals in the body. Note how thewhy-not provenance of 3hop(c, a) in Fig. 2 is similar but differentfrom the why provenance of 3hop(a, a) in Fig. 1: In order to showthat 3hop(c, a) is false, one has to show that all possible ways thatit could be true are failing, i.e., for all z1, z2, the ground instancesr1(c, a, z1, z2) do not derive 3hop(c, a) (since at least one goal inr1’s body is always false). In constrast, to prove that 3hop(a, a)is true, it is sufficient to find some ground instance r1(a, a, z1, z2)whose body is true. Earlier we saw that there are exactly three suchinstances, corresponding to p ·p ·p+p ·q ·r+q ·r ·p (= p3+2pqr).
Domain Dependence of Provenance Games. As seen, 3hop(a, a)has three derivations, represented by the first provenance polyno-mial in Fig. 1(c) and the game provenance in Fig. 1(d) and (e). Howmany ways are there to show that 3hop(c, a) is false (why-not pro-venance), or equivalently, that ¬ 3hop(c, a) is true? If we annotatethe leaves of the game graph in Fig. 2 with identifiers u1, . . . , u5 forthe five different hop tuples missing in I , we can construct a pro-venance expression that represents the many ways why 3hop(c, a)is not in the answer. While this answer provides a comprehensive,instance-based why-not explanation, it also exhibits a problem withthe current approach: In order to obtain finite (why and why-not)provenance answers for all first-order queries, game provenanceemploys an active domain semantics: e.g., the provenance gamefor Q
3hop
(I) considers only ground instances of r1 over the activedomain adom(I) = {a, b, c}. If additional elements d, e, . . . areadded to I (e.g., via a disconnected graph component), the why-notprovenance in Fig. 2 becomes incomplete and the provenance hasto be recomputed for the larger domain.
Constraint Provenance Games. We propose to solve the prob-lem of domain dependence by modifying provenance games sothat they can handle certain infinite relations that can be finitelyrepresented. For example, in addition to the finitely many reasonswhy 3hop(c, a) fails over the active domain adom(I), there are in-finitely many others, if we consider new constants d, e, . . . outsideof adom(I). For example, let relation R = {a, b} have two tuplesR(a) and R(b). If we want to know why-not R(c), we just point toc /2 R. But we could also return a more general answer for why-notR(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (notjust for x = c). This approach is inspired by Chan’s ConstructiveNegation [Cha88], a form of constraint logic programming [Stu95].The key idea is to represent (potentially infinite) relations throughconstraints, i.e., Boolean combinations of equalities x = c and dis-equalities x 6= c.
Overview and Contributions. Section 2 briefly explains how first-order queries are translated into games and how provenance is ex-tracted from solved games. In Section 3 we describe the construc-tion of constraint provenance games; additional details and exam-ples are contained in the appendix. Our main contributions are:(i) game provenance provides a uniform treatment of why and why-not provenance for first-order logic (= relational algebra with set-difference); (ii) for positive queries, the approach captures the mostinformative semiring provenance [GKT07, KG12]; (iii) we developa constraint provenance framework which yields domain indepen-dent provenance expressions, extending prior results [KLZ13]; and(iv) we implemented a prototype of constraint provenance games.
2. Provenance through Games
We first sketch how a query Q over database I gives rise to a gameG
Q(I) and how to obtain provenance from the solved game G�
Q(I).Consider, e.g., input relations B(X,Y ) and C(Y ) and a relationalquery Q
ABC
with set-difference: A ⇡X
(B on (⇡Y
(B) \ C)). It iswell-known that any relational algebra query can be translated intoa non-recursive Datalog¬ program. Here, we have Q
ABC
=
r2 : A(X) :� B(X,Y ),¬C(Y ).
The key idea of provenance games is to understand query evalu-ation as a game between players I and II who argue whether ornot a tuple is in the answer. In [KLZ13] we showed that the solvedgame is a representation of why (why-not) provenance of answertuples (missing tuples), respectively. Fig. 3a shows the game tem-plate for Q
ABC
: to prove that A(x) is true, player I needs to find arule instance of r2, say A(x) :� B(x, y),¬C(y) which derives thedesired tuple A(x) and whose choice y for the 9-quantified vari-able Y in the body satisfies all literals (subgoals) in the rule body.In the game template in Fig. 3a this corresponds to a move fromA(X) to r2(X,Y ) while choosing a suitable domain value y forthe 9-quantified variable Y . Player II can challenge this claim by“attacking” one of the subgoals g in the rule body. If player I chosethe “wrong” y for the instance r2(x, y), then II can always attackat least one subgoal that falsifies the body. The game continues inturns, until a player cannot move and loses, and the opponent wins.
A game template GQ
for query Q contains literal nodes (oval;for atoms or their negation), rule nodes (boxes; for Datalog¬ rules),and goal nodes (rounded boxes; subgoals of rules): see Fig. 3a.Edge labels indicate a condition for a move: e.g., the label “9Y ”between a literal node, say A(X), and a rule node, say r2(X,Y ),requires a player to pick a value y for the 9-quantified variable Ywhen moving from an atom to the rule that derives it. Similarly,a condition “X:=Y ” means that the current choice of Y becomes
Source [KLZ13]
Database Provenance: Summary • Fine-‐grained “white-‐box” provenance • Solved (preVy much) for posiQve queries • … not so much for negaQon and “Why-‐Not”
– AcCve area of research! • Some research prototypes … • … and some real-‐world implementaCons! • Note: Those in need of provenance o`en already “do it”!! – Crash recovery, audiCng, concurrency control, …
More Provenance Mileage from Workflows and Scripts 39
Outline • All things “Provenance” … • Provenance: Why should you care? • Provenance in Databases
– Why-‐, How-‐, …, Why-‐Not Provenance
• Provenance in ScienQfic Workflows • YesWorkflow: Doing more (someCmes with less)
More Provenance Mileage from Workflows and Scripts 40
Scientific Workflows: ASAP! • Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles) – wfs should make use of parallel compute resources – wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles) – wfs should be easy to (re-)use, evolve, share
• Provenance – wfs should capture processing history, data lineage è traceable data- and wf-evolution è Reproducible Science Trident
Workbench
VisTrails
More Provenance Mileage from Workflows and Scripts 41 Es war einmal …
Phylogenetics workflow in Kepler (2005)
Graphical interface § Canvas for assembling
and displaying the workflow.
§ Library of workflow blocks (‘actors’) that can be dragged onto the canvas and connected.
§ Arrows that represent control dependencies or paths of data flow.
§ A run button.
These features are not essential to managing actual scientific workflows.
What some of us think of when we hear the term ‘scienQfic workflows’
Source: Tim McPhillips More Provenance Mileage from Workflows and Scripts 42
10 Key FuncQons of a Sci-‐WFS 1. Automate programs and services scientists already use.
2. Schedule invocations of programs and services correctly and efficiently – in parallel where possible.
3. Manage dataflow to, from, and between programs and services.
4. Enable scientists (not just developers) to author or modify workflows easily.
5. Predict what a workflow will do when executed: prospective provenance.
6. Record what actually happens during workflow execution.
7. Reveal retrospective provenance – how workflow products were derived from inputs via programs and services.
8. Organize intermediate and final data products as desired by users.
9. Enable scientists to version, share and publish their workflows.
10. Empower scientists who wish to automate additional programs and services themselves.
These functions–not actors—distinguish scientific workflow automation from general scientific software development.
More Provenance Mileage from Workflows and Scripts 43 Tim McPhillips et al.
Yes, scripts are (can be) workflows too!
Interactive Visualization
More Provenance Mileage from Workflows and Scripts 44
SKOPE: Synthesized Knowledge Of Past Environments
More Provenance Mileage from Workflows and Scripts 45
Bocinsky, Kohler et al. study rain-‐fed maize of Anasazi – Four Corners; AD 600–1500. Climate change influenced Mesa Verde MigraQons; late
13th century AD. Uses network of tree-‐ring chronologies to reconstruct a spaQo-‐temporal climate field at a fairly high resoluCon (~800 m) from AD 1–2000. Algorithm esCmates joint informaCon in tree-‐rings and a climate signal to idenCfy “best” tree-‐ring chronologies for climate reconstrucCng.
K. Bocinsky, T. Kohler, A 2000-‐year reconstrucCon of the rain-‐fed maize agricultural niche in the US Southwest. Nature
Communica1ons. doi:10.1038/ncomms6618
… implemented as an R Script …
… HPCBio Workflows @ Illinois
More Provenance Mileage from Workflows and Scripts 46
NaIonal Petascale CompuIng Facility
Broad InsQtute: Recommended workflow for variant analysis
Liudmila Mainzer, Victor Jongeneel HPC Bio @ Illinois
Quickly, say: #!/bin/bash
It’s Qme to shi` control …
More Provenance Mileage from Workflows and Scripts 47
• … back from being consumers of someone else’s (= our) tools .. – “Just click here!”
• ... to tool makers! – ScienCsts who author workflows as scripts!
• Go where the wild things (users!) are … – Yes, develop for “end users” … – … but don’t forget the tool makers!
• Can we do this together?
Mount Sample
Screen Sample
Align Sample
Expose Sample
Analyze Images
Check Criteria
Calculate Strategy
Collect Data Set
Calculate Maps
List Peaks
Run Search
Refine Structure
Integrate Images
Scale ReflecQons
Merge ReflecQons
Calc Amplitudes
Collect Data
Process Data
Solve Structure
Analyze Density
Blu-Ice LABELIT
molrep refmac
z
ipmosflm
xds pointless scala xtriage truncate rfree
Example: AutoDrug Workflow
More Provenance Mileage from Workflows and Scripts 48
Tsai, Y., McPhillips, S. E., González, A., McPhillips, T. M., Zinn, D., Cohen, A. E., ... & SolCs, S. (2013). AutoDrug: fully automated macromolecular crystallography workflows for fragment-‐based drug discovery. Acta Crystallographica SecCon D: Biological Crystallography, 69(5), 796-‐803.
Diffraction images
Experimental electron density and protein
model
Full protein structure
3D Protein Structure DeterminaQon by X-‐ray
Crystallography
More Provenance Mileage from Workflows and Scripts 49 Source: Tim McPhillips
Crystal in loop
Sample mounting robot
Cassette shipping dewar
Crystal mounting pin
Sample cassette
Automated Sample Handling
Alice, the high-‐throughput crystallographer: When the first shi| of her beam Cme begins, technicians at the beam line load the three casseVes into a liquid nitrogen dewar within reach of the sample-‐mounCng robot and close the radiaCon door. From this point Alice is able to control beam line operaCons remotely.
More Provenance Mileage from Workflows and Scripts 50 Source: Tim McPhillips
Remote beam line operaQon
More Provenance Mileage from Workflows and Scripts 51 Source: Tim McPhillips
Outline • All things “Provenance” … • Provenance: Why should you care? • Provenance in Databases
– Why-‐, How-‐, …, Why-‐Not Provenance
• Provenance in ScienCfic Workflows • YesWorkflow: Doing more with Provenance!
– … someCmes using less (e.g., no provenance recorder)
More Provenance Mileage from Workflows and Scripts 52
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
YesWorkflow: Yes, scripts are workflows, too!
• Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: **
More Provenance Mileage from Workflows and Scripts 53
Enter: YesWorkflow! (yesworkflow.org)
• YesWorkflow (YW) – Grass-‐roots effort – … meeCng the scienCsts/users where they R!
• R, Matlab, (i)Python, Jupyter, …
– Scripts + simple user annotaCons
• => Reveal the workflow model/abstracQon … that underlies the (script) implementaIon
• => YW can give us more of ASAP! – First YW: ASAP (AbstracCon)... – Then YW-‐recon: ASAP (reconstrucCng runQme Provenance)
54 More Provenance Mileage from Workflows and Scripts
Related Work, other Approaches … to bring workflow/provenance benefits to scripts: • RunQme Provenance Recorders:
– use (R, Python, ..) libraries and/or code instrumentaQon to capture runQme observables
• file read/write, funcCon calls, program variables & state, … – noWorkflow system
• [Murta-‐Braganholo-‐ChirigaC-‐Koop-‐Freire-‐IPAW14] • exploit Python profiling library to capture runCme provenance
=> helps with "S" and "P"
More Provenance Mileage from Workflows and Scripts 55
YW (prospec1ve) and YW-‐Recon (retrospec1ve) Provenance • 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J
• 2. Run script – Files are read and wriVen – Folder-‐ & Filenames have metadata
• 3. YW-‐Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-‐template queries
• cf. “ls -‐R” & RegEx matching
• 4. YW-‐Query – Answer the user’s provenance queries
More Provenance Mileage from Workflows and Scripts 56
YW annotaQons: Model your Workflow!
More Provenance Mileage from Workflows and Scripts 57
YesWorkflow: ProspecQve & RetrospecCve Provenance … (almost) for free!
• YW annotaCons in the script (R, Python, Matlab) are used to recreate the workflow view from the script …
More Provenance Mileage from Workflows and Scripts 58
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
YW!
Voila! The Workflow revealed!
More Provenance Mileage from Workflows and Scripts 59
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
Get 3 views for the price of 1!
More Provenance Mileage from Workflows and Scripts 60
Process view
Data view
Combined view
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
Paleoclimate ReconstrucQon (EnviRecon.org)
More Provenance Mileage from Workflows and Scripts 61
• … explained using YesWorkflow!
Kyle B., (computaConal) archaeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-‐annotate, all-‐told."
Provenance Lands
62
Workflow Modeling & Design (a.k.a. prospec1ve provenance
“Workflow-‐land”)
RunQme Provenance (a.k.a. traces, logs,
retrospec1ve provenance, “Trace-‐land”)
More Provenance Mileage from Workflows and Scripts
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YW-‐RECON: ProspecCve & RetrospecQve Provenance … (almost) for free!
More Provenance Mileage from Workflows and Scripts 63
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
• URI-‐templates link conceptual enCCes to runQme provenance “le| behind” by the script author …
• … facilitaCng provenance reconstrucQon
YW (prospec1ve) and YW-‐Recon (retrospec1ve) Provenance • 1. YW: Annotate Script => YW Model
– Annotate @BEGIN..@END, @IN, @OUT – Visualize, share, be happy J
• 2. Run script – Files are read and wriVen – Folder-‐ & Filenames have metadata
• 3. YW-‐Recon – Use @URI tags that link YW Model ó Persisted Data – Run URI-‐template queries
• cf. “ls -‐R” & RegEx matching
• 4. YW-‐Query – Answer the user’s provenance queries
More Provenance Mileage from Workflows and Scripts 64
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Data collecQon workflow (X-‐ray diffracCon)
More Provenance Mileage from Workflows and Scripts 65
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Data collecCon workflow: runQme data
More Provenance Mileage from Workflows and Scripts 66
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
1. YW annotaQons => YW model 2. Files & Folders le` by a run => runQme (meta-‐)data
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q1: What samples did the script run collect images from?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
More Provenance Mileage from Workflows and Scripts 67
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q2: What energies were used for image collecCon from sample DRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
More Provenance Mileage from Workflows and Scripts 68
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q3: Where is the raw image of the corrected image DRT322_11000ev_030.img? run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
More Provenance Mileage from Workflows and Scripts 69
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Q5: What casseqe-‐id had the sample leading to DRT240_10000ev_001.img?
More Provenance Mileage from Workflows and Scripts 70
Querying Provenance
More Provenance Mileage from Workflows and Scripts 71
Taking YW for a spin … • “To document on-‐the fly, specifically for a given
workflow configuraIon invoked: – do not insert annotaIons into code, – but rather have code print annota1ons into a special log
during execuIon, – then parse that log!” – Liudmila Mainzer
More Provenance Mileage from Workflows and Scripts 72
Source: L Mainzer, V Jongeneel (IGB & NCSA)
Conclusions • Provenance
– … in databases – … in scienCfic workflows
• Scripts are (o|en) workflows too! • è Need to support provenance management for scripts and scienCfic workflows!
• One size might not fit all … – Use prospecCve, retrospecCve (recorded, reconstructed provenance)
• Facilitate “insider” (or “deep”) provenance – … the stuff scienCsts need to get their job done!
More Provenance Mileage from Workflows and Scripts 73
Deep Provenance to get the science done!
• When reconstrucCng the past climate, need to know which tree-‐ring source was used!
More Provenance Mileage from Workflows and Scripts 74
CRTZ
MVNP
ESPNLANL
Arizona
Colorado
New Mexico
Utah
Douglas firPinyon and juniperSpruce, pine, and true firGHCN stations
K. Bocinsky, T. Kohler, A 2000-‐year reconstrucCon of the rain-‐fed maize agricultural niche in the US Southwest. Nature Communica1ons. doi:10.1038/ncomms6618
Conclusions (Cont’d) • YesWorkflow: Go where the users are!
– … they already capture provenance through metadata! • Beware your level of provenance abstracQon
– Let the user provide a workflow model easily! • YW-‐Recon:
– … finishing support for retrospecQve provenance without using a runCme provenance recorder!
– Key insight: scienCsts already leave provenance “bread crumbs” behind! (it’s not an accident!)
• Future Work: – Build systems that work with the exisCng workflow of scienCsts! – There are many research quesCons & opportuniCes out there!
• e.g.: Why-‐Not provenance for scienCfic workflows anyone?
More Provenance Mileage from Workflows and Scripts 75
References …
More Provenance Mileage from Workflows and Scripts 76
References (cont’d)
More Provenance Mileage from Workflows and Scripts 77
Recommended