77

Click here to load reader

Works 2015-provenance-mileage

Embed Size (px)

Citation preview

Page 1: Works 2015-provenance-mileage

YesWorkflow: More Provenance Mileage from Scientific Workflows

and Scripts!

Bertram  Ludäscher  Director, Center for Informatics Research in Science and Scholarship (CIRSS)

Professor, Graduate School of Library and Information Science (GSLIS) Faculty affiliate, NCSA & Department of Computer Science

Page 2: Works 2015-provenance-mileage

Outline  •  All  things  “Provenance”  …    •  Provenance:  Why  should  you  care?  •  Provenance  in  Databases  

– Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance  

•  …  vs  Provenance  in  ScienCfic  Workflows  •  YesWorkflow:  Doing  more  (someCmes  with  less)  

More  Provenance  Mileage  from  Workflows  and  Scripts   2  

Page 3: Works 2015-provenance-mileage

Provenance  Palooza    •  Provenance    

–  …  or  provenience?  •  Chain  of  custody  •  Lineage  •  Pedigree  •  Genealogy    •  Phylogeny    •  History  •  Origin  

More  Provenance  Mileage  from  Workflows  and  Scripts   3

Page 4: Works 2015-provenance-mileage

Provenance  Research  everywhere  …    …  and  here:  

More  Provenance  Mileage  from  Workflows  and  Scripts   4  

Page 5: Works 2015-provenance-mileage

Provenance as we all know it •  Oxford English Dictionary:

–  coming from some particular source or quarter; origin, derivation

–  the history or pedigree of a work of art, manuscript, rare book, etc.

–  concretely, a record of the passage of an item through its various owners (“chain of custody”)

•  Merriam-Webster: –  prov·e·nance noun \ˈpräv-nəәn(t)s, ˈprä-vəә-ˌnän(t)s\ –  the origin or source of something

•  Origin: –  French, from provenir to come forth, originate, from Latin

provenire, from pro- forth + venire to come

More  Provenance  Mileage  from  Workflows  and  Scripts   5  

Page 6: Works 2015-provenance-mileage

Provenance

6 More  Provenance  Mileage  from  Workflows  and  Scripts  

Is  this  a  real  Leonardo?  Lack  of  reliable  Provenance  casts  a  doubt  on  this  …    

Page 7: Works 2015-provenance-mileage

Pedigree

7 More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 8: Works 2015-provenance-mileage

More  Provenance  Mileage  from  Workflows  and  Scripts   8  

Natural  History:    Understanding  what  happened…  

Zrzavý,  Jan,  David  Storch,  and  Stanislav  Mihulka.  EvoluIon:  Ein  Lese-­‐Lehrbuch.  Springer-­‐Verlag,  2009.  

Author:  Jkwchui  (Based  on  drawing  by  Truth-­‐seeker2004)  

Page 9: Works 2015-provenance-mileage

Provenience  vs  Provenance  

More  Provenance  Mileage  from  Workflows  and  Scripts   9

Page 10: Works 2015-provenance-mileage

More  Provenance  Mileage  from  Workflows  and  Scripts   10

Society  of  American  Archivists    hVp://www2.archivists.org/glossary/

terms/p/provenance    

•  Principle  of  provenance  (respect  des  fonds)  

•  Keep  records  of  different  origins  separate  to  preserve  context    

Archivists  

Page 11: Works 2015-provenance-mileage

So  what  is  “provenance”  (sensu  W3C)  ?  

•  Provenance  refers  to  the  sources  of  informaIon,  including  en11es  and  processes,  involving  in  producing  or  delivering  an  ar1fact  (*)  

•  Provenance  is  a  descripIon  of  how  things  came  to  be,  and  how  they  came  to  be  in  the  state  they  are  in  today    (*)  

•  Provenance  is  a  record  that  describes  the  people,  ins1tu1ons,  en11es,  and  ac1vi1es,  involved  in  producing,  influencing,  or  delivering  a  piece  of  data  or  a  thing  in  the  world  

More  Provenance  Mileage  from  Workflows  and  Scripts   11  

Page 12: Works 2015-provenance-mileage

Outline  •  All  things  “Provenance”  …    •  Provenance:  Why  should  you  care?  •  Provenance  in  Databases  

– Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance  

•  Provenance  in  ScienCfic  Workflows  •  YesWorkflow:  Doing  more  (someCmes  with  less)  

More  Provenance  Mileage  from  Workflows  and  Scripts   12  

Page 13: Works 2015-provenance-mileage

Provenance  =>  Transparency  •  =  “Externally-­‐facing”  provenance    – “Them-­‐Provenance”  

•  Later:  “Internally-­‐facing”  provenance  – “Me-­‐Provenance”  

More  Provenance  Mileage  from  Workflows  and  Scripts   13  

Page 14: Works 2015-provenance-mileage

Climate  Change:  Whodunnit?  

More  Provenance  Mileage  from  Workflows  and  Scripts   14  

Page 15: Works 2015-provenance-mileage

Tracing  the  sources  (data,  code)    

More  Provenance  Mileage  from  Workflows  and  Scripts   15  

Page 16: Works 2015-provenance-mileage

From “Climate Gate” to Reproducible Science

16 More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 17: Works 2015-provenance-mileage

Data & Provenance Management: Single Model

17 More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 18: Works 2015-provenance-mileage

Data & Provenance Management: Model Chains

18 More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 19: Works 2015-provenance-mileage

Some things people do with “provenance”

•  Result  validaCon      •  Result  debugging  (science  vs  wf  logic)  •  Reproducibility  and  Repeatability      •  ExplanaCon  (derivaCons,  traces,  proof  trees)  •  RunCme  monitoring  

–  Profiling,  benchmarking  

•  Performance  OpCmizaCon  (“smart  re-­‐run”)  •  Fault-­‐tolerance,  crash-­‐recovery  •  Database  view  maintenance  (e.g.  data  warehousing)  •  …    

19 More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 20: Works 2015-provenance-mileage

Provenance for Virtual Joint Experiments

•  How do we ensure that Charlie gets a complete account of the history of Wc’s outputs?

•  How do we ensure that Alice gets her due (partial) credit when Charlie uses Bob’s data v? è traces TA and TB will be critical è need to compose them to obtain TC

We  can  view  the  composiCon  WC  as  a  new,  virtual  workflow  

Charlie

Alice

(1) develop! WA

(2) run! RA

z x Bob

(3) develop!WB

(5) run!RB

v u f

v

WC:=

(6) inspect

provenance!

(7) understand,

generate!W

A W

S W

B

u z x

(4) data sharing!

TA! TB!f -1

More  Provenance  Mileage  from  Workflows  and  Scripts   20  

Page 21: Works 2015-provenance-mileage

Open  Provenance  Model  =>  W3C  Prov  

More  Provenance  Mileage  from  Workflows  and  Scripts   21  

Page 22: Works 2015-provenance-mileage

W3C  Prov:  One  size  fits  all?  

More  Provenance  Mileage  from  Workflows  and  Scripts   22  

Page 23: Works 2015-provenance-mileage

Outline  •  All  things  “Provenance”  …    •  Provenance:  Why  should  you  care?  •  Provenance  in  Databases  

– Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance  

•  Provenance  in  ScienCfic  Workflows  •  YesWorkflow:  Doing  more  (someCmes  with  less)  

More  Provenance  Mileage  from  Workflows  and  Scripts   23  

Page 24: Works 2015-provenance-mileage

Types of Data Provenance •  Black-box

–  know (next to) nothing at compile-time –  at runtime: keep some data lineage –  most prov sensu WF work use this

•  White-box –  statically (compile-time) analyzable –  q(Y1,Y2) :- p(X1,X2), r(X1,Y1), s(X2,Y2) –  Most prov sensu DB work use this

•  Grey-box –  can “look inside” (some black boxes) –  … e.g. b/c they have subworkflows –  … or FP signatures: A :: t1, t2à t3,t4 –  … or semantic annotations (sem.types)

f

A

q

t1 t2

t3 t4

X1 X2

Y1 Y2

More  Provenance  Mileage  from  Workflows  and  Scripts   24  

Page 25: Works 2015-provenance-mileage

Provenance  in  Databases  

More  Provenance  Mileage  from  Workflows  and  Scripts   25  

Source:  Val  Tannen  

Page 26: Works 2015-provenance-mileage

Provenance  in  Databases  

More  Provenance  Mileage  from  Workflows  and  Scripts   26  

Source:  Val  Tannen  

Page 27: Works 2015-provenance-mileage

Provenance  in  Databases  

More  Provenance  Mileage  from  Workflows  and  Scripts   27  

Source:  Val  Tannen  

Page 28: Works 2015-provenance-mileage

AbstracQng  the  structure  of  querying  

More  Provenance  Mileage  from  Workflows  and  Scripts   28  

Source:  Val  Tannen  In  database  provenance,  tuples  are  either  combined  conjunc1vely  (*)  or  disjunc1vely  (+)  è  That’s  the  core  model!  

Page 29: Works 2015-provenance-mileage

Provenance  Polynomials  One  Semiring  to  Rule  them  all!  (DB  theory  strikes!)  

More  Provenance  Mileage  from  Workflows  and  Scripts   29  

Green,  Karvounarakis,  Tannen.  Provenance  semirings,  PODS,  2007  

Unifying  most  prior  work  in  a  simple  model!  

Page 30: Works 2015-provenance-mileage

Example:  Go  from  X  to  Y  in  3  hops!  (e.g.,  a  =  CS      b  =  NCSA      c  =  GSLIS)  

•  Database:          hop(X,Y)  :=    

   •  Query:    3hop(X,Y)  :-­‐              hop(X,  Z1),  hop(Z1,  Z2),  hop(Z2,Y).  

More  Provenance  Mileage  from  Workflows  and  Scripts   30  

a

p

bq

rcs

Note:  Can  not  go  from  c  to  a  in  3hops!    

a

ppp+pqr+qrpbppq+qrq

cpqsppr+qrr

rpq

rqs

hop(a,a,  p).  hop(a,b,  q).  hop(b,a,  r)  hop(b,c,  s).  

3hop(a,a,  p3+2pqr).  3hop(a,b,  p2q+q2r).  …    3hop(a,c,  pqs).  

Page 31: Works 2015-provenance-mileage

Provenance  Polynomials      

More  Provenance  Mileage  from  Workflows  and  Scripts   31  

,,Mein  Schatz!”  

     p3  +  2pqr                    

     p3  +    pqr                        p  +  2pqr                    

     p  +    pqr                    

     pqr                    

     p  +    pqr                    

p  

a

ppp+pqr+qrpbppq+qrq

cpqsppr+qrr

rpq

rqs

Page 32: Works 2015-provenance-mileage

32 More  Provenance  Mileage  from  Workflows  and  Scripts  

Provenance in Databases

Page 33: Works 2015-provenance-mileage

NegaQon  &  Why-­‐Not  Provenance  

More  Provenance  Mileage  from  Workflows  and  Scripts   33  

•  Provenance  Semirings  work  well  for:  – PosiQve  Queries  (e.g.,  RA+  )  

•  Challenges:  Handling  of    – set  difference  (~  negaQon)  – Why-­‐not  provenance      – Missing  Answer  provenance  

 •  A  fresh  look  at  provenance!  •  …  using  an  old  idea:  Game  semanQcs!      

Page 34: Works 2015-provenance-mileage

Provenance  (or  Query  EvaluaIon)  Games  

More  Provenance  Mileage  from  Workflows  and  Scripts   34  

 “SLD-­‐resoluQon  game”      A(X)  :–  B(X,Y,Z)    …  not  C(X,Y)  …  

  Eureka! [KLZ13]  Köhler,  S.,  Ludäscher,  B.,  &  Zinn,  D.  (2013).  First-­‐order  provenance  games.  

In  Search  of  Elegance  in  the  Theory  and  PracIce  of  ComputaIon.  Springer  

Page 35: Works 2015-provenance-mileage

TranslaQon:  Q(I) => G Q(I)

More  Provenance  Mileage  from  Workflows  and  Scripts   35  

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

Source  [KLZ13]  

Page 36: Works 2015-provenance-mileage

Solve  G Q(I)  =>  Provenance!    

More  Provenance  Mileage  from  Workflows  and  Scripts   36  

A(X)

C(X)

B(X,Y )

r2(X,Y )g12(X,Y )

g22(Y )

rB

(X,Y )

rC

(X)

¬A(X)

¬B(X,Y )

¬C(X)

B(X,Y )

C(X)X:=Y

9Y

(a) Game template for QABC

: A(X) :� B(X,Y ),¬C(Y ).

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a)

g12(a, a)

B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b) rB

(a, b)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(b) Instantiated Q

ABC

game on I = {B(a, b), B(b, a), C(a)}.

¬C(a)

¬C(b)

¬B(a, a)

¬B(a, b)

rB

(b, a)

r2(b, a)¬A(b)

¬A(a) rB

(a, b)B(a, b)

B(a, a)

C(a)

g22(a)

g22(b)C(b)

¬B(b, a)

¬B(b, b)

rC

(a)

A(b)

A(a)

r2(a, b)

r2(a, a)

g12(a, b)

g12(a, a)

r2(b, b)g12(b, b)

g12(b, a)

B(b, b)

B(b, a)

9a

9b

9b

9a

(c) Solved game: lost positions are (dark) red; won positionsare (light) green. Provenance edges (= good moves) are solid.Bad moves are dashed and not part of the provenance. A(a) istrue (A(b) is false) as it is won (lost) in the solved game; thegame provenance explains why (why-not).

Figure 3: Provenance game for Q

ABC

. The well-founded model ofwin(X) :� M(X,Y ),¬win(Y ), applied to move graph M, solves the game.

the new binding for X; a condition “B(X,Y )” means that a moveis possible only if B(X,Y ) is true in I for the current X , Y values.2

Given database I , a template can be instantiated yielding a gamegraph G

Q(I) as in Fig. 3b. Note how template variables (e.g., Y )have been replaced by domain values (a or b), and that conditionaledges (e.g., labeled “C(X)”) became unconditional edges (e.g.,C(a)! r

C

(a)) or no edge at all (e.g., from C(b)), depending onwhether or not the condition holds in I . To extract why(-not)provenance from a game graph G

Q(I) as in Fig. 3b, we need tosolve the game first, i.e., determine which positions are won (lightgreen) or lost (dark red); see Fig. 3c. There is a surprisingly simpleand elegant solution: the (unstratified) Datalog¬ rule Q

wm

:=

win(X) :� move(X,Y ),¬win(Y )

when evaluated under the well-founded semantics [VGRS91]solves the game! Thus we can use Q

wm

as a “game engine” tosolve the provenance game with a move relation given by G

Q

(I).3

Finally, the solved game is a labeled graph G�

Q(I), i.e., eachnode carries a new label �, indicating whether a position is won(light green) or lost (dark red). As shown in [KLZ13], only edgesfrom won to lost nodes (green) and lost to won nodes (red) are partof the provenance; other edges (grey, dashed in Fig. 3c) correspondto “bad moves” (invalid arguments in the query evaluation game)and are excluded from the provenance. The provenance subgraph of

2 Readers familiar with logic programming semantics may recognize thatprovenance games mimic a form of SLD(NF) resolution.3 Indeed, both our prototypes use Q

wm

to compute (constraint) provenance.

g12(b, c)

g12(b, b)

r2(b, a)

¬B(b, c) B(b, c)

g22(a)

¬B(b, b)

rC

(a)

A(b)

C(a)

B(b, b)r2(b, b)

r2(b, c)

9 c

9 a

9 b

Figure 4: Altered subgraph of Fig. 3c after adding c to the active domain.

¬B :x1 6= a,x1 6= b,x2 = a

C :x1 = a

A :x1 = a

A :x1 = b

¬C :x1 6= a

¬A :x1 6= a,x1 6= b

C :x1 6= a

R2 :X = a,Y = a

R2 :X = a,Y = b

B :x1 6= a,x2 6= a

R2 :X 6= a,Y 6= a

RB

:x1 = b,x2 = a

B :x1 = a,x2 = b

A :x1 6= a,x1 6= b

G22 : ¬C :Y 6= a

G12 : B :

X 6= a,X 6= b,Y = a

B :x2 6= b,x1 = a

¬A :x1 = b

¬A :x1 = a

G12 : B :

Y 6= b,X = a

¬B :x1 6= a,x2 6= a

¬B :x1 = a,x2 = b

B :x1 = b,x2 = a

RC

:x1 = a

¬B :x2 6= b,x1 = a

RB

:x1 = a,x2 = b

R2 :Y 6= b,X = a,Y 6= a

G12 : B :

X 6= a,Y 6= a

G12 : B :

X = b,Y = a

B :x1 6= a,x1 6= b,x2 = a

R2 :X 6= a,X 6= b,Y = a

G12 : B :

X = a,Y = b

R2 :X = b,Y = a

¬C :x1 = a

¬B :x1 = b,x2 = a

G22 : ¬C :Y = a

Figure 5: Constraint provenance game for QABC

. Unlike in Figure 3, nodesmay represent finite or infinite sets here.

G�

Q(I) thus consists only of edges that are matched by the regularpath queries (g.r)+ and r.(g.r)⇤, i.e., alternating sequences ofgreen (winning) and red (delaying) moves [KLZ13].

3. Constraint Provenance Games

Consider the solved game graph of Fig. 3c. If the value c wereadded to the active domain, the provenance would be incomplete:e.g., to explain why-not A(b) there are two 9a, 9b branches ema-nating from A(b). However, with c in the active domain there is athird 9c branch via r2(b, c): see Fig. 4. We show that a modifiedgame construction (Fig. 5) based on constraints can be used to au-tomatically include such extensions of the active domain, therebyeliminating the domain dependence of the original approach.

Similarly, one could conclude from Fig. 2 that the absence of3hop(c, a) from the query answer is due entirely to the absenceof hop(a, c), hop(c, a), hop(c, c), hop(c, b), and hop(b, b). Alsothis explanation, however, is complete only relative to the activedomain: if d was introduced into the domain, new why-not answerssuch as r1(c, a, d, d) would have to be added to the provenancegraph in Fig. 2. The new version of the provenance game (Fig. 9),however, takes care of this via a more general constraint node R1 :X 6=a, X 6=b, Z1 6=c, Z1 6=a, Z1 6=b, Z2 6=c, Z2 6=a, Z2 6=b, Y 6=c.

In constraint provenance games, nodes stand for sets of groundnodes. A constraint tuple such as “3hop(x, y): x=a, y=b” maystand for a single tuple (here: 3hop(a, b)), or for (possibly in-finitely) many: e.g., “3hop(x, y): x 6=a, x 6=b, y=a” stands for theset { 3hop(x, a) | x 2 D \ {a, b)} } over any underlying domainD (finite or infinite).

Source  [KLZ13]  

Page 37: Works 2015-provenance-mileage

Provenance  ~  Query  EvaluaQon  Game    

More  Provenance  Mileage  from  Workflows  and  Scripts   37  

Towards Constraint Provenance Games

Sean Riddle Sven Kohler Bertram LudascherDepartment of Computer Science, University of California, Davis, CA 95616

{swriddle, svkoehler, ludaesch}@ucdavis.edu

Abstract

Provenance for positive queries is well understood and elegantlyhandled by provenance semirings [GKT07], which subsume manyearlier approaches. However, the semiring approach does not ex-tend easily to why-not provenance or, more generally, first-orderqueries with negation. An alternative approach is to view queryevaluation as a game between two players who argue whether, forgiven database I and query Q, a tuple t is in the answer Q(I) or not.For first-order logic, the resulting provenance games [KLZ13] yielda new provenance model that coincides with provenance semirings(how provenance) on positive queries, but also is applicable to first-order queries with negation, thus providing an elegant, uniformtreatment of earlier approaches, including why-not provenance andnegation. In order to obtain a finite answer to a why-not question,provenance games employ an active domain semantics and enu-merate tuples that contribute to failed derivations, resulting in a do-main dependent formalism. In this paper, we propose constraintprovenance games as a means to address this issue. The key idea isto represent infinite answers (e.g., to why-not questions) by finiteconstraints, i.e., equalities and disequalities.

1. Introduction

Consider the relation hop(x, y) in Fig. 1a and query Q3hop

:=

r1 : 3hop(X,Y ) :� hop(X,Z1), hop(Z1, Z2), hop(Z2, Y ).

Q3hop

asks for pairs of nodes that are reachable via exactly threeedges (“hops”). If we ask why and how a tuple such as 3hop(a, a)came about, we can use polynomials over a provenance semiring[GKT07, KG12] to get a precise answer, here: p3+2pqr. In Fig. 1awe see that one can “go” from node a to itself in three hops indistinct ways: (i) by using the edge p (= hop(a, a), a self-loop)three times: p·p·p, or p3 for short, (ii) by using the p edge once,followed by q (= hop(a, b)) and then r (= hop(b, a)), so p·q·r,or (iii) by following q, r, and then p, i.e., q·r·p. Since semiringprovenance is commutative, p·q·r + q·r·p = 2pqr as shown inthe figure. Many prior provenance approaches can be understoodas special provenance semirings: e.g., Trio provenance [BSHW06],why-provenance [BKT01], and lineage [CWW00], all yield coarserversion of the provenance p3 + 2pqr of 3hop(a, a), i.e., p+ 2pqr,p+ pqr, and pqr, respectively [KG12].

Provenance through Games. In Fig. 1c we see that 3hop(c, a) isabsent, so 3hop(c, a) is false. We cannot use semiring provenanceto explain why-not, since the approach is not defined for negativequeries and extensions for negation (or set-difference) are not ob-vious [GP10, GIT11, ADT11a, ADT11b]. On the other hand, if anapproach can explain the provenance of ¬A, this naturally providesa why-not explanation for A. In [KLZ13] we proposed an alterna-tive model of provenance that naturally supports negation. Considerthe graph in Fig. 1d. It can be understood as the move graph of aquery evaluation game in which two players argue whether or not

a p

b

q r

c

s

(a) input I ...

hop

a a pa b qb a rb c s

(b) ... annotated.

3hop

a a p3 + 2pqra b p2q + q2ra c pqsb a p2r + qr2

b b pqrb c qrs

(c) 3hop with provenance.

r1(a, a, b, a)

g21(a, a)

¬hop(b, a)

g11(a, a)

hop(b, a)

g21(a, b) g31(b, a)

rhop

(b, a)

r1(a, a, a, a)

r1(a, a, a, b)

3hop(a, a)

g31(a, a)

rhop

(a, a)

hop(a, b)

¬hop(a, a)

g11(a, b)

rhop

(a, b)

g21(b, a)

¬hop(a, b)

hop(a, a)

9 a,a 9 b,a

9 a,b

(d) The game provenance of 3hop(a, a) ...

+

+

+

+ +

r

+

+

p

+

+

q

+

+

(e) ... is p3 + 2pqr.

Figure 1: Each edge hop(x, y) in the input graph I in (a) is annotated(p, q, r, ...) in (b). The answer to Q

3hop

is shown in (c) with provenancepolynomials [KG12]. The game provenance [KLZ13], e.g., of 3hop(a, a)in (d) corresponds to the semiring provenance polynomial in (c): see (e).

a tuple t 2 Q(I). If a player wants to prove that t = 3hop(a, a) isin Q

3hop

, she needs to move to a ground rule r with t in the head,thereby claiming that this rule instance is deriving t. In Fig. 1d,there are three choices, starting from the root node 3hop(a, a): themove to r1(a, a, a, a), to r1(a, a, a, b), or to r1(a, a, b, a). Herer1(x, y, z1, z2) identifies ground instances of r1. There are two 8-quantified variables X and Y occurring in the head and body, andtwo (implicitly) 9-quantified variables Z1 and Z2, occurring onlyin r1’s body. By moving to a ground instance of r1 in the game, theplayer tries to pick values for the 9-quantified variables that makethe rule body true while deriving t in the head. For r1, the middleedge hop(z1, z2) fixes the bindings of Z1 and Z2. For the givendatabase instance I , there are three choices that “work”: (a, a),(a, b), and (b, a). This means that there are exactly three differ-ent ways to obtain 3hop(a, a) via r1 over input I: if we choose thep-hop (a, a) as the middle edge, we have p·p·p; for the q-hop (a, b)we have p·q·r; and for the r-hop (b, a) we have q·r·p.1 The oppo-nent can now challenge each of these claims, by selecting a subgoal

1 Game provenance [KLZ13] can distinguish p·q·r and q·r·p and is thuseven more fine-grained than the provenance semirings in [GKT07, KG12].

Provenance  Game  on  GQ(I)      =    Provenance  Polynomials    …  for  posiQve  queries!  

Source  [KLZ13]  

Page 38: Works 2015-provenance-mileage

Provenance  ~  Query  EvaluaQon  Game    

More  Provenance  Mileage  from  Workflows  and  Scripts   38  

…  but  also  works  for  Why-­‐Not  provenance  &  non-­‐monotonic  queries  (i.e.,  Q  can  have  negaQon)  !!    Here:  not  3hop(c,a)  –  can’t  go  back  from            GSLIS    to        CS  

                                 c                        a    

g21(c, a)

¬3hop(c, a)

g21(c, c)g11(c, c)

r1(c, a, c, b)

¬hop(c, b)

hop(c, a)

g21(b, b)

¬hop(a, c)

hop(c, c)

g11(c, a)

r1(c, a, b, c)r1(c, a, a, b)

3hop(c, a)

hop(b, b)

g21(c, b)g21(a, c)

r1(c, a, a, c)

¬hop(c, c)

hop(c, b)

¬hop(c, a)

g11(c, b)

r1(c, a, b, b)

¬hop(b, b)

g31(c, a)

r1(c, a, a, a) r1(c, a, b, a)

hop(a, c)

r1(c, a, c, a) r1(c, a, c, c)

9 a,b 9 a,c 9 c,a 9 c,c9 b,c 9 b,b9 b,a9 a,a 9 c,b

Figure 2: Why-not provenance for 3hop(c, a) using provenance games.

gi1 in the body of r1, thus claiming that gi1 is false and hence thatthe r1 instance doesn’t derive t. The first player can counter anddemonstrate that gi1 is true by selecting a rule instance or fact asevidence for gi1. The game proceeds in rounds until some playercannot move and thus loses (the opponent wins). In [KLZ13] itwas shown how the provenance of a tuple t can be obtained via aregular path query over a solved game graph like the one in Fig. 1d:e.g., p3 + 2pqr for 3hop(a, a) is represented by a solved gameas shown in Fig. 1e: for positive queries, solved games representsemiring provenance by noting that won (green) and lost (red) po-sitions correspond to “+” and “⇥” operations, respectively (leavesrepresent input annotations, here: p, q, r, s) [KLZ13].

Why-Not Provenance and the Many Ways to Fail. Since gamesare inherently symmetric (one player’s win is the opponent’s lossand vice versa), the approach yields an elegant provenance modelthat unifies why and why-not provenance. Consider the (dark, red)node 3hop(c, a) in Fig. 2. The color coding indicates that the posi-tion 3hop(c, a) is lost (the atom is false), i.e., all outgoing movesto a node r1(x, y, z1, z2) lead to a position that is won for the oppo-nent. There are 9 such positions, e.g., r1(c, a, c, b) is one of them(third from the right). Recall that an instance of r1 means that onecan do a 3-hop from x to y (here: c to a) via intermediate nodesz1 and z2 (here: c and b). However, in the given database I inFig. 1(a), there is no hop(c, z) – neither for z = b nor for any otherz, since there are no outgoing moves from c. In this case, the op-ponent can successfully attack the goals in the body. Note how thewhy-not provenance of 3hop(c, a) in Fig. 2 is similar but differentfrom the why provenance of 3hop(a, a) in Fig. 1: In order to showthat 3hop(c, a) is false, one has to show that all possible ways thatit could be true are failing, i.e., for all z1, z2, the ground instancesr1(c, a, z1, z2) do not derive 3hop(c, a) (since at least one goal inr1’s body is always false). In constrast, to prove that 3hop(a, a)is true, it is sufficient to find some ground instance r1(a, a, z1, z2)whose body is true. Earlier we saw that there are exactly three suchinstances, corresponding to p ·p ·p+p ·q ·r+q ·r ·p (= p3+2pqr).

Domain Dependence of Provenance Games. As seen, 3hop(a, a)has three derivations, represented by the first provenance polyno-mial in Fig. 1(c) and the game provenance in Fig. 1(d) and (e). Howmany ways are there to show that 3hop(c, a) is false (why-not pro-venance), or equivalently, that ¬ 3hop(c, a) is true? If we annotatethe leaves of the game graph in Fig. 2 with identifiers u1, . . . , u5 forthe five different hop tuples missing in I , we can construct a pro-venance expression that represents the many ways why 3hop(c, a)is not in the answer. While this answer provides a comprehensive,instance-based why-not explanation, it also exhibits a problem withthe current approach: In order to obtain finite (why and why-not)provenance answers for all first-order queries, game provenanceemploys an active domain semantics: e.g., the provenance gamefor Q

3hop

(I) considers only ground instances of r1 over the activedomain adom(I) = {a, b, c}. If additional elements d, e, . . . areadded to I (e.g., via a disconnected graph component), the why-notprovenance in Fig. 2 becomes incomplete and the provenance hasto be recomputed for the larger domain.

Constraint Provenance Games. We propose to solve the prob-lem of domain dependence by modifying provenance games sothat they can handle certain infinite relations that can be finitelyrepresented. For example, in addition to the finitely many reasonswhy 3hop(c, a) fails over the active domain adom(I), there are in-finitely many others, if we consider new constants d, e, . . . outsideof adom(I). For example, let relation R = {a, b} have two tuplesR(a) and R(b). If we want to know why-not R(c), we just point toc /2 R. But we could also return a more general answer for why-notR(x) and say that ¬R(x) is true for all x with x 6= a ^ x 6= b (notjust for x = c). This approach is inspired by Chan’s ConstructiveNegation [Cha88], a form of constraint logic programming [Stu95].The key idea is to represent (potentially infinite) relations throughconstraints, i.e., Boolean combinations of equalities x = c and dis-equalities x 6= c.

Overview and Contributions. Section 2 briefly explains how first-order queries are translated into games and how provenance is ex-tracted from solved games. In Section 3 we describe the construc-tion of constraint provenance games; additional details and exam-ples are contained in the appendix. Our main contributions are:(i) game provenance provides a uniform treatment of why and why-not provenance for first-order logic (= relational algebra with set-difference); (ii) for positive queries, the approach captures the mostinformative semiring provenance [GKT07, KG12]; (iii) we developa constraint provenance framework which yields domain indepen-dent provenance expressions, extending prior results [KLZ13]; and(iv) we implemented a prototype of constraint provenance games.

2. Provenance through Games

We first sketch how a query Q over database I gives rise to a gameG

Q(I) and how to obtain provenance from the solved game G�

Q(I).Consider, e.g., input relations B(X,Y ) and C(Y ) and a relationalquery Q

ABC

with set-difference: A ⇡X

(B on (⇡Y

(B) \ C)). It iswell-known that any relational algebra query can be translated intoa non-recursive Datalog¬ program. Here, we have Q

ABC

=

r2 : A(X) :� B(X,Y ),¬C(Y ).

The key idea of provenance games is to understand query evalu-ation as a game between players I and II who argue whether ornot a tuple is in the answer. In [KLZ13] we showed that the solvedgame is a representation of why (why-not) provenance of answertuples (missing tuples), respectively. Fig. 3a shows the game tem-plate for Q

ABC

: to prove that A(x) is true, player I needs to find arule instance of r2, say A(x) :� B(x, y),¬C(y) which derives thedesired tuple A(x) and whose choice y for the 9-quantified vari-able Y in the body satisfies all literals (subgoals) in the rule body.In the game template in Fig. 3a this corresponds to a move fromA(X) to r2(X,Y ) while choosing a suitable domain value y forthe 9-quantified variable Y . Player II can challenge this claim by“attacking” one of the subgoals g in the rule body. If player I chosethe “wrong” y for the instance r2(x, y), then II can always attackat least one subgoal that falsifies the body. The game continues inturns, until a player cannot move and loses, and the opponent wins.

A game template GQ

for query Q contains literal nodes (oval;for atoms or their negation), rule nodes (boxes; for Datalog¬ rules),and goal nodes (rounded boxes; subgoals of rules): see Fig. 3a.Edge labels indicate a condition for a move: e.g., the label “9Y ”between a literal node, say A(X), and a rule node, say r2(X,Y ),requires a player to pick a value y for the 9-quantified variable Ywhen moving from an atom to the rule that derives it. Similarly,a condition “X:=Y ” means that the current choice of Y becomes

Source  [KLZ13]  

Page 39: Works 2015-provenance-mileage

Database  Provenance:  Summary  •  Fine-­‐grained  “white-­‐box”  provenance  •  Solved  (preVy  much)  for  posiQve  queries  •  …  not  so  much  for  negaQon  and  “Why-­‐Not”  

– AcCve  area  of  research!  •  Some  research  prototypes  …    •  …  and  some  real-­‐world  implementaCons!  •  Note:  Those  in  need  of  provenance  o`en  already  “do  it”!!  – Crash  recovery,  audiCng,  concurrency  control,  …    

More  Provenance  Mileage  from  Workflows  and  Scripts   39  

Page 40: Works 2015-provenance-mileage

Outline  •  All  things  “Provenance”  …    •  Provenance:  Why  should  you  care?  •  Provenance  in  Databases  

– Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance  

•  Provenance  in  ScienQfic  Workflows  •  YesWorkflow:  Doing  more  (someCmes  with  less)  

More  Provenance  Mileage  from  Workflows  and  Scripts   40  

Page 41: Works 2015-provenance-mileage

Scientific Workflows: ASAP! •  Automation

–  wfs to automate computational aspects of science

•  Scaling (exploit and optimize machine cycles) –  wfs should make use of parallel compute resources –  wfs should be able handle large data

•  Abstraction, Evolution, Reuse (human cycles) –  wfs should be easy to (re-)use, evolve, share

•  Provenance –  wfs should capture processing history, data lineage è traceable data- and wf-evolution è  Reproducible Science Trident  

Workbench  

VisTrails  

More  Provenance  Mileage  from  Workflows  and  Scripts   41  Es  war  einmal  …      

Page 42: Works 2015-provenance-mileage

Phylogenetics workflow in Kepler (2005)

Graphical interface §  Canvas for assembling

and displaying the workflow.

§  Library of workflow blocks (‘actors’) that can be dragged onto the canvas and connected.

§  Arrows that represent control dependencies or paths of data flow.

§  A run button.

These features are not essential to managing actual scientific workflows.

What  some  of  us  think  of  when  we  hear  the  term  ‘scienQfic  workflows’  

Source:  Tim  McPhillips  More  Provenance  Mileage  from  Workflows  and  Scripts   42  

Page 43: Works 2015-provenance-mileage

10  Key  FuncQons  of  a  Sci-­‐WFS  1.  Automate programs and services scientists already use.

2.  Schedule invocations of programs and services correctly and efficiently – in parallel where possible.

3.  Manage dataflow to, from, and between programs and services.

4.  Enable scientists (not just developers) to author or modify workflows easily.

5.  Predict what a workflow will do when executed: prospective provenance.

6.  Record what actually happens during workflow execution.

7.  Reveal retrospective provenance – how workflow products were derived from inputs via programs and services.

8.  Organize intermediate and final data products as desired by users.

9.  Enable scientists to version, share and publish their workflows.

10.  Empower scientists who wish to automate additional programs and services themselves.

These functions–not actors—distinguish scientific workflow automation from general scientific software development.

More  Provenance  Mileage  from  Workflows  and  Scripts   43  Tim  McPhillips  et  al.  

Page 44: Works 2015-provenance-mileage

Yes, scripts are (can be) workflows too!

Interactive Visualization

More  Provenance  Mileage  from  Workflows  and  Scripts   44  

Page 45: Works 2015-provenance-mileage

SKOPE:  Synthesized  Knowledge  Of  Past  Environments  

More  Provenance  Mileage  from  Workflows  and  Scripts   45  

Bocinsky,  Kohler  et  al.  study  rain-­‐fed  maize  of  Anasazi    –  Four  Corners;  AD  600–1500.  Climate  change  influenced  Mesa  Verde  MigraQons;  late  

13th  century  AD.  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spaQo-­‐temporal  climate  field  at  a  fairly  high  resoluCon  (~800  m)  from  AD  1–2000.  Algorithm  esCmates  joint  informaCon  in  tree-­‐rings  and  a  climate  signal  to  idenCfy  “best”    tree-­‐ring  chronologies  for  climate  reconstrucCng.  

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  

Communica1ons.  doi:10.1038/ncomms6618    

… implemented as an R Script …

Page 46: Works 2015-provenance-mileage

…  HPCBio  Workflows  @  Illinois  

More  Provenance  Mileage  from  Workflows  and  Scripts   46  

 NaIonal  Petascale  CompuIng  Facility  

Broad  InsQtute:    Recommended  workflow  for  variant  analysis  

Liudmila  Mainzer,  Victor  Jongeneel  HPC  Bio  @  Illinois  

Quickly,  say:    #!/bin/bash  

Page 47: Works 2015-provenance-mileage

It’s  Qme  to  shi`  control  …  

More  Provenance  Mileage  from  Workflows  and  Scripts   47  

•  …  back  from  being  consumers  of  someone  else’s  (=  our)  tools  ..    –  “Just  click  here!”  

•  ...  to  tool  makers!  –  ScienCsts  who  author  workflows  as  scripts!  

•  Go  where  the  wild  things  (users!)  are  …      –  Yes,  develop  for  “end  users”  …      – …  but  don’t  forget  the  tool  makers!  

•  Can  we  do  this  together?    

Page 48: Works 2015-provenance-mileage

Mount  Sample  

Screen Sample

Align  Sample  

Expose  Sample  

Analyze  Images  

Check  Criteria  

Calculate  Strategy  

Collect  Data  Set  

Calculate  Maps  

List  Peaks  

Run  Search  

Refine  Structure  

Integrate  Images  

Scale  ReflecQons  

Merge  ReflecQons  

Calc  Amplitudes  

Collect Data

Process  Data

Solve  Structure

Analyze  Density

Blu-Ice LABELIT

molrep  refmac  

z  

ipmosflm  

xds pointless  scala  xtriage  truncate  rfree  

Example:  AutoDrug  Workflow  

More  Provenance  Mileage  from  Workflows  and  Scripts   48  

Tsai,  Y.,  McPhillips,  S.  E.,  González,  A.,  McPhillips,  T.  M.,  Zinn,  D.,  Cohen,  A.  E.,  ...  &  SolCs,  S.  (2013).  AutoDrug:  fully  automated  macromolecular  crystallography  workflows  for  fragment-­‐based  drug  discovery.  Acta  Crystallographica  SecCon  D:  Biological  Crystallography,  69(5),  796-­‐803.  

Page 49: Works 2015-provenance-mileage

Diffraction images

Experimental electron density and protein

model

Full protein structure

3D  Protein  Structure  DeterminaQon  by  X-­‐ray  

Crystallography    

More  Provenance  Mileage  from  Workflows  and  Scripts   49  Source:  Tim  McPhillips  

Page 50: Works 2015-provenance-mileage

Crystal  in  loop  

Sample mounting robot

Cassette shipping dewar

Crystal mounting pin

Sample cassette

Automated  Sample  Handling  

Alice,  the  high-­‐throughput  crystallographer:  When  the  first  shi|  of  her  beam  Cme  begins,  technicians  at  the  beam  line  load  the  three  casseVes  into  a  liquid  nitrogen  dewar  within  reach  of  the  sample-­‐mounCng  robot  and  close  the  radiaCon  door.    From  this  point  Alice  is  able  to  control  beam  line  operaCons  remotely.  

More  Provenance  Mileage  from  Workflows  and  Scripts   50  Source:  Tim  McPhillips  

Page 51: Works 2015-provenance-mileage

Remote  beam  line  operaQon  

More  Provenance  Mileage  from  Workflows  and  Scripts   51  Source:  Tim  McPhillips  

Page 52: Works 2015-provenance-mileage

Outline  •  All  things  “Provenance”  …    •  Provenance:  Why  should  you  care?  •  Provenance  in  Databases  

– Why-­‐,  How-­‐,  …,  Why-­‐Not  Provenance  

•  Provenance  in  ScienCfic  Workflows  •  YesWorkflow:  Doing  more  with  Provenance!  

– …  someCmes  using  less  (e.g.,  no  provenance  recorder)  

More  Provenance  Mileage  from  Workflows  and  Scripts   52  

Page 53: Works 2015-provenance-mileage

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?  

YesWorkflow:    Yes,  scripts  are  workflows,  too!  

•  Script  vs  Workflows/ASAP:  – Automation:    *****  – Scaling:          **  – Abstraction:  *    – Provenance:    **  

More  Provenance  Mileage  from  Workflows  and  Scripts   53  

Page 54: Works 2015-provenance-mileage

Enter:  YesWorkflow!  (yesworkflow.org)  

•  YesWorkflow  (YW)  –  Grass-­‐roots  effort      –  …  meeCng  the  scienCsts/users  where  they  R!  

•  R,  Matlab,  (i)Python,  Jupyter,  …  

–  Scripts  +  simple  user  annotaCons  

•  =>  Reveal  the  workflow  model/abstracQon      …  that  underlies  the  (script)  implementaIon  

•  =>  YW  can  give  us  more  of  ASAP!  –  First  YW:    ASAP  (AbstracCon)...  –  Then  YW-­‐recon:  ASAP  (reconstrucCng  runQme  Provenance)  

54  More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 55: Works 2015-provenance-mileage

Related  Work,  other  Approaches  …  to  bring  workflow/provenance  benefits  to  scripts:  •  RunQme  Provenance  Recorders:  

– use  (R,  Python,  ..)  libraries  and/or  code  instrumentaQon  to  capture  runQme  observables  

•  file  read/write,  funcCon  calls,  program  variables  &  state,  …  – noWorkflow  system    

•  [Murta-­‐Braganholo-­‐ChirigaC-­‐Koop-­‐Freire-­‐IPAW14]    •  exploit  Python  profiling  library  to  capture  runCme  provenance  

=>  helps  with  "S"  and  "P"        

More  Provenance  Mileage  from  Workflows  and  Scripts   55  

Page 56: Works 2015-provenance-mileage

YW  (prospec1ve)  and    YW-­‐Recon  (retrospec1ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wriVen  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

More  Provenance  Mileage  from  Workflows  and  Scripts   56  

Page 57: Works 2015-provenance-mileage

YW  annotaQons:  Model  your  Workflow!  

More  Provenance  Mileage  from  Workflows  and  Scripts   57  

Page 58: Works 2015-provenance-mileage

YesWorkflow:  ProspecQve  &  RetrospecCve  Provenance  …  (almost)  for  free!    

•  YW  annotaCons  in  the  script  (R,  Python,  Matlab)  are  used  to  recreate  the  workflow  view  from  the  script  …    

More  Provenance  Mileage  from  Workflows  and  Scripts   58  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!  

Page 59: Works 2015-provenance-mileage

Voila!  The  Workflow  revealed!  

More  Provenance  Mileage  from  Workflows  and  Scripts   59  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

Page 60: Works 2015-provenance-mileage

Get  3  views  for  the  price  of  1!  

More  Provenance  Mileage  from  Workflows  and  Scripts   60  

Process  view  

Data  view  

Combined  view  

Page 61: Works 2015-provenance-mileage

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

Paleoclimate  ReconstrucQon  (EnviRecon.org)    

More  Provenance  Mileage  from  Workflows  and  Scripts   61  

•  …  explained  using  YesWorkflow!  

Kyle  B.,  (computaConal)  archaeologist:    "It  took  me  about  20  minutes  to  comment.  Less  than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  

Page 62: Works 2015-provenance-mileage

Provenance Lands

62  

Workflow  Modeling  &  Design  (a.k.a.  prospec1ve  provenance  

“Workflow-­‐land”)  

RunQme  Provenance    (a.k.a.  traces,  logs,      

retrospec1ve  provenance,  “Trace-­‐land”)  

More  Provenance  Mileage  from  Workflows  and  Scripts  

Page 63: Works 2015-provenance-mileage

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-­‐RECON:  ProspecCve  &  RetrospecQve  Provenance  …  (almost)  for  free!    

More  Provenance  Mileage  from  Workflows  and  Scripts   63  

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

•  URI-­‐templates  link  conceptual  enCCes  to  runQme  provenance  “le|  behind”  by  the  script  author  …    

•  …  facilitaCng  provenance  reconstrucQon  

Page 64: Works 2015-provenance-mileage

YW  (prospec1ve)  and    YW-­‐Recon  (retrospec1ve)  Provenance  •  1.  YW:  Annotate  Script  =>  YW  Model  

–  Annotate  @BEGIN..@END,  @IN,  @OUT  –  Visualize,  share,  be  happy  J    

•  2.  Run  script  –  Files  are  read  and  wriVen  –  Folder-­‐  &  Filenames  have  metadata  

•  3.  YW-­‐Recon  –  Use  @URI  tags  that  link  YW  Model  ó  Persisted  Data  –  Run  URI-­‐template  queries    

•  cf.  “ls  -­‐R”  &  RegEx  matching  

•  4.  YW-­‐Query  –  Answer  the  user’s  provenance  queries    

More  Provenance  Mileage  from  Workflows  and  Scripts   64  

Page 65: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collecQon  workflow  (X-­‐ray  diffracCon)  

More  Provenance  Mileage  from  Workflows  and  Scripts   65  

Page 66: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Data  collecCon  workflow:  runQme  data  

More  Provenance  Mileage  from  Workflows  and  Scripts   66  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

1.   YW  annotaQons  =>  YW  model  2.   Files  &  Folders  le`  by  a  run  =>  runQme  (meta-­‐)data  

Page 67: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:  What  samples  did  the  script  run  collect  images  from?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 More  Provenance  Mileage  from  Workflows  and  Scripts   67  

Page 68: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:  What  energies  were  used  for  image  collecCon  from  sample  DRT322?  

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 More  Provenance  Mileage  from  Workflows  and  Scripts   68  

Page 69: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:  Where  is  the  raw  image  of  the  corrected  image  DRT322_11000ev_030.img?    run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

More  Provenance  Mileage  from  Workflows  and  Scripts   69  

Page 70: Works 2015-provenance-mileage

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:  What  casseqe-­‐id  had  the  sample  leading  to  DRT240_10000ev_001.img?  

More  Provenance  Mileage  from  Workflows  and  Scripts   70  

Page 71: Works 2015-provenance-mileage

Querying  Provenance  

More  Provenance  Mileage  from  Workflows  and  Scripts   71  

Page 72: Works 2015-provenance-mileage

Taking  YW  for  a  spin  …    •  “To  document  on-­‐the  fly,  specifically  for  a  given  

workflow  configuraIon  invoked:    –  do  not  insert  annotaIons  into  code,  –  but  rather  have  code  print  annota1ons  into  a  special  log  

during  execuIon,  –  then  parse  that  log!”      –  Liudmila  Mainzer  

More  Provenance  Mileage  from  Workflows  and  Scripts   72  

Source:  L  Mainzer,  V  Jongeneel  (IGB  &  NCSA)    

Page 73: Works 2015-provenance-mileage

Conclusions  •  Provenance  

– …  in  databases  – …  in  scienCfic  workflows  

•  Scripts  are  (o|en)  workflows  too!  •  è  Need  to  support  provenance  management  for  scripts  and  scienCfic  workflows!  

•  One  size  might  not  fit  all  …  – Use  prospecCve,  retrospecCve  (recorded,  reconstructed  provenance)  

•  Facilitate  “insider”  (or  “deep”)  provenance  – …  the  stuff  scienCsts  need  to  get  their  job  done!  

More  Provenance  Mileage  from  Workflows  and  Scripts   73  

Page 74: Works 2015-provenance-mileage

Deep  Provenance  to  get  the  science  done!  

•  When  reconstrucCng  the  past  climate,  need  to  know  which  tree-­‐ring  source  was  used!  

More  Provenance  Mileage  from  Workflows  and  Scripts   74  

CRTZ

MVNP

ESPNLANL

Arizona

Colorado

New Mexico

Utah

Douglas firPinyon and juniperSpruce, pine, and true firGHCN stations

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucCon  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  Communica1ons.  doi:10.1038/ncomms6618    

Page 75: Works 2015-provenance-mileage

Conclusions  (Cont’d)  •  YesWorkflow:  Go  where  the  users  are!  

–  …  they  already  capture  provenance  through  metadata!  •  Beware  your  level  of  provenance  abstracQon  

–  Let  the  user  provide  a  workflow  model  easily!    •  YW-­‐Recon:  

–  …  finishing  support  for  retrospecQve  provenance  without  using  a  runCme  provenance  recorder!  

–  Key  insight:  scienCsts  already  leave  provenance  “bread  crumbs”  behind!  (it’s  not  an  accident!)  

•  Future  Work:  –  Build  systems  that  work  with  the  exisCng  workflow  of  scienCsts!  –  There  are  many  research  quesCons  &  opportuniCes  out  there!  

•  e.g.:  Why-­‐Not  provenance  for  scienCfic  workflows  anyone?    

More  Provenance  Mileage  from  Workflows  and  Scripts   75  

Page 76: Works 2015-provenance-mileage

References    …    

More  Provenance  Mileage  from  Workflows  and  Scripts   76  

Page 77: Works 2015-provenance-mileage

References  (cont’d)  

More  Provenance  Mileage  from  Workflows  and  Scripts   77