76
A Brief Provenance Tour … via DataONE Bertram Ludäscher & Dave Vieglais DataONE Users Group meeting 2017-07-24, Bloomington, IN Indiana Memorial Union

A Brief Provenance Tour … via DataONE

Embed Size (px)

Citation preview

Page 1: A Brief Provenance Tour  … via DataONE

ABriefProvenanceTour…via DataONE

BertramLudäscher&DaveVieglais

DataONE UsersGroupmeeting

2017-07-24,Bloomington,INIndianaMemorialUnion

Page 2: A Brief Provenance Tour  … via DataONE

ATourofProvenance:Overview

§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat§ ...how-to(inDataONE)?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance

§ ProvenanceinDataONE

§Acloserlookatprovenance

2

Page 3: A Brief Provenance Tour  … via DataONE

ATourofProvenance:Overview

§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat§ ...how-to(inDataONE)?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance

§ ProvenanceinDataONE

§Acloserlookatprovenance

3

Page 4: A Brief Provenance Tour  … via DataONE

ThePriceisRight!Right?

• Oneoftheseishasbeensoldfornearly$180million.• Theothercould beworthasmuchormore.• Whichiswhich?• Whatisthedifference?Provenance@DUG-2017 4

https://en.wikipedia.org/wiki/Les_Femmes_d%27Alger#.22Version_O.22

https://en.wikipedia.org/wiki/La_Bella_Principessa

Page 5: A Brief Provenance Tour  … via DataONE

Provenance defined…• OxfordEnglishDictionary

– Theplaceoforiginorearliestknown history ofsomething:• anorangerugofIranianprovenance

– Thebeginning ofsomething’sexistence;itsorigin:• theytrytounderstandthewholeuniverse,itsprovenance andfate

– Arecordofownershipofaworkofartoranantique,usedasaguidetoauthenticity orquality:

• themanuscripthasadistinguishedprovenance

• Whatistheprovenance(origin)of“provenance”?

Provenance@DUG-2017 5

Page 6: A Brief Provenance Tour  … via DataONE

ComputationalProvenancedefined…• Provenanceofwhat?Bywhom?Forwhat?

• Origin andprocessinghistoryofdigital artifact…– usually:data(products),figures,...– sometimes:workflow (andscript)evolution…– …byresearchers,(computational-,data-)scientists– ...fordata(re-)use– …fortransparency,reproducibility– ...forothers?...self?

• Differentsub-communitiesstudyprovenance:– Provenancein(scientific)workflows…– Provenanceindatabases…– Wait,thereismore:

• ...programminglanguages,systems/security,…• …informationscience,archivalscience,diplomatics

– Lastnotleast:provenanceinthesciencesciences!• a.k.a.naturalsciences

Provenance@DUG-2017 6

Page 7: A Brief Provenance Tour  … via DataONE

Provenance intheNaturalSciences..

• Canyou“seeprovenance”inthisimage?• GrandCanyon’srocklayersarearecordoftheearlygeologichistoryofNorthAmerica.

Theancestralpuebloan granariesatNankoweap Creektellarchaeologistsaboutmorerecenthumanhistory.(ByDrenaline,licensedunderCCBY-SA3.0)

• Notshown:computationalarchaeologistsreconstructingpastclimatefrommultipletree-ringdatabasesè computationalprovenanceiskeyfortransparency &reproducibility

Provenance@DUG-2017 7

Page 8: A Brief Provenance Tour  … via DataONE

…inBiology&NaturalHistoryProvenance =Understanding whathappened…

Zrzavý,Jan,DavidStorch,and Stanislav

Mihulka.Evolution:EinLese-Lehrbuch.

Springer-Verlag,2009.

Author:Jkwchui (BasedondrawingbyTruth-seeker2004)

Provenance@DUG-2017 8Natura non facit saltus

Page 9: A Brief Provenance Tour  … via DataONE

Provenance-in-Science Palooza• Whatarethose?• Cosmology• Geology,Stratigraphy• Phylogeny

– theTreeofLife

• Genealogy– yourfamily:literally

• AcademicPedigree– “Doktorvater”(Doktormutter)

• Etymology• Chainofcustody

– ofart(ifacts)

• Fromprovenance toexplanationsandunderstandingProvenance@DUG-2017 9

Page 10: A Brief Provenance Tour  … via DataONE

UsingProvenance forTransparency,Reproducibility

• Whatinput data wentintothisstudy?

• Whatmethods wereused?• …withwhat parameter

settings, calibrations,…?

• Canwetrust thedataandmethods?

§ Provenance (lineage):trackorigin andprocessinghistoryofdata§ è query provenancetounderstand,exploit(data,code)dependencies§ è attribution,credit,dataqualityassessment,trustviaaudittrails,provenance§ è discovery ofdata,methodologies,experiments

Provenance@DUG-2017 10

https://en.wikipedia.org/wiki/Hockey_stick_graph

Page 11: A Brief Provenance Tour  … via DataONE

ClimateChange:Whodunnit?

Provenance@DUG-2017 11

Page 12: A Brief Provenance Tour  … via DataONE

Trackingsources(data,code)……thehardway

Provenance@DUG-2017 12

Page 13: A Brief Provenance Tour  … via DataONE

Provenance today: Importance:✓ How-to:??

èprojects,groupsconductR&Donprovenancemethods,tools,…

Inparticular:

“Thisreportistheresultofathree-yearanalyticaleffortbyateamofover300experts,overseenbyabroadlyconstitutedFederalAdvisoryCommitteeof60members.Itwasdevelopedfrominformationandanalysesgatheredinover70workshopsandlisteningsessionsheldacrossthecountry.”

Provenance@DUG-2017 13

Page 14: A Brief Provenance Tour  … via DataONE

From Provenance to Reproducible Science …

Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science=> need workflow-system agnostic model

Provenance@DUG-2017 14

Page 15: A Brief Provenance Tour  … via DataONE

... via scientific workflows (… and scripts)

Provenance@DUG-2017 15

Page 16: A Brief Provenance Tour  … via DataONE

Tour Stop: Scientific Workflows: ASAP• Automation

– wfs to automate computational aspects of science

• Scaling (exploit and optimize machine cycles)

– wfs should make use of parallel compute resources – wfs should be able handle large data

• Abstraction, Evolution, Reuse (human cycles)– wfs should be easy to (re-)use, evolve, share

• Provenance– wfs should capture processing history, data lineageè traceable data- and wf-evolutionè Reproducible Science

TridentWorkbench

VisTrails

Es wareinmal …Provenance@DUG-2017 16

Page 17: A Brief Provenance Tour  … via DataONE

Executable WATERS Workflow in Kepler

Provenance@DUG-2017 17

Page 18: A Brief Provenance Tour  … via DataONE

Data Curation Workflows (Filtered-Push … Kepler … Kurator projects)

Provenance@DUG-2017 18

Page 19: A Brief Provenance Tour  … via DataONE

RuntimeProvenance(a.k.a.traces,logs,

retrospectiveprovenance,“Trace-land”)

WorkflowModeling&Design(a.k.a.prospective provenance

“Workflow-land”)

Provenance@DUG-2017 19

Workflowsó Provenanceanimportantlink!

Page 20: A Brief Provenance Tour  … via DataONE

ProvONE:PROVforscientificworkflows(Transferstationtoanyofseveralother“standardextensions”)

“Trace-Land” (retrospective provenance)

“Data-Land”

YangCao1,ChristopherJones2,Víctor Cuevas-Vicenttín3,MatthewB.Jones2,BertramLudäscher1,TimothyMcPhillips1,PaoloMissier4,ChristopherSchwalm5,PeterSlaughter2,DaveVieglais6,LaurenWalker2,Yaxing Wei71UniversityofIllinois,Urbana-Champaign,2NationalCenterforEcologicalAnalysisandSynthesis,UCSB,3UniversidadPopularAutónoma delEstadodePuebla,Mexico,4SchoolofComputingScience,NewcastleUniversity,UK,5WoodsHoleResearchCenter,Falmouth,MA,6UniversityofKansas,Lawrence,7EnvironmentalSciencesDivision,OakRidgeNationalLab,TN

Also: OPM-W(G&Getal),others…

“Workflow-Land” (prospective prov.)

Provenance@DUG-2017 20

Page 21: A Brief Provenance Tour  … via DataONE

ProvenanceStandards vsTools• Doweneedmorestandardstosortthisout?• Howshouldwethink aboutprovenance?

– ...inworkflows,scripts,databases?• Whatcanwedo withprovenance?

– ...inworkflowsanddatabases?• Tools tocreate,share,use provenance

– …notjustfor“provenanceforothers”– ...needmore“provenanceforself”

• è creating,using(querying!)provenance• …inDataONE!• Modelingscriptsasworkflows&linkingprovenanceè YesWorkflow toolkit(later)

Provenance@DUG-2017 21

Page 22: A Brief Provenance Tour  … via DataONE

DataONE Summer2017Internship

• YesWorkflow (YW)modellinkedtoProvONE (W3CPROVextension)• Thissummer:YWmodelinRDF;YWmodelqueriesinSPARQL:

DataONE summerinterns2017(LinhHoang,HuiLyu,UIUC)• https://github.com/idaks/DataONE-Prov-Summer-2017Provenance@DUG-2017 22

LinhHoang,HuiLyu @UIUC

Page 23: A Brief Provenance Tour  … via DataONE

ATourofProvenance:Overview

§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat,how-to?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance

§ProvenanceinDataONE

§Acloserlookatprovenance

23

Page 24: A Brief Provenance Tour  … via DataONE

ProvenanceinDataONE

§ PhaseIIGoal:Facilitatereproduciblescience

§ Trackdataderivationhistory§ Trackdatainputsandoutputsofanalyses§ Trackanalysisandmodelexecutions§ Preserveanddocumentsoftwareworkflows§ Link allofthesetopublications

24

Page 25: A Brief Provenance Tour  … via DataONE

ProvenanceataGlance

§ProvONE extensionstoW3CPROV§Newfirst-classobjectsinDataONE

§ Figures,Software

§ExtendedDataPackagewithProvenance§NewTools

§ Provenanceindexing§ Websearchandbrowseinterfaceforprovenance§ Matlab toolforgeneratingprovenance§ Rtoolforgeneratingprovenance§ YesWorkflow toolformodelingprovenancefromscripts

25

Page 26: A Brief Provenance Tour  … via DataONE

ProvONE extendsPROVforscience

“Trace-Land” (retrospective provenance)

“Data-Land”

“Workflow-Land” (prospective prov.)

Provenance@DUG-2017 26

Page 27: A Brief Provenance Tour  … via DataONE

Provenance inDataONEStoringMetadata

§DatasetmodelinDataONE

§Whereprovenanceinformationisstored

27

Page 28: A Brief Provenance Tour  … via DataONE

ModelingProvenanceRelationships

§ prov:used§ prov:generated§ prov:derivedFrom

Data Package 1

metadata science data

figuressoftware

cito:documents

prov:used

prov:generated

prov:derivedFrom

science datadata granule 1

(doi:10.5063/F1Z899CZ)

OAI-ORE with ProvONE trace

28

Page 29: A Brief Provenance Tour  … via DataONE

Provenance…ofData

29 bit.ly/DWS_01_04Provenance@DUG-2017

Page 30: A Brief Provenance Tour  … via DataONE

ProvenanceDisplay:DataONE SearchUI

30

...

Provenance@DUG-2017

Page 31: A Brief Provenance Tour  … via DataONE

ProvenanceDisplay: DataONE SearchUI

31

https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171Provenance@DUG-2017

Page 32: A Brief Provenance Tour  … via DataONE

ProvenanceDisplay:DataONE SearchUI

32

https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171Provenance@DUG-2017

Page 33: A Brief Provenance Tour  … via DataONE

ProvenanceDisplay:DataONE SearchUI

33https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171

Provenance@DUG-2017

Page 34: A Brief Provenance Tour  … via DataONE

ProvenanceinDataONECreatingProvenanceMetadata

§Howtogenerateprovenance§ Retrospective§ Prospective

34

Page 35: A Brief Provenance Tour  … via DataONE

InvestigatorTools

35

RecordrRLibrary

MatlabDataONEToolbox

YesWorkflow ToolYW

Page 36: A Brief Provenance Tour  … via DataONE

Example:R-programming

36

1 # Generate map of locations by type

2 library(recordr)

3 recordr <- new(“Recordr”)

4pkg <- record(recordr, “./hcdbSites.R”, “loc-by-type-png”)

Page 37: A Brief Provenance Tour  … via DataONE

YesWorkflow:ModelingScriptsasWorkflows

37

1 # @begin CreateGulfOfAlaskaMaps

2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv

3 # @in world @as RWorldMap

4 # @out map @as Map_Of_Sampling_Locations.png

5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png

... mapping code is here ...

25 # @end CreateGulfOfAlaskaMaps

Page 38: A Brief Provenance Tour  … via DataONE

TransitiveCredit

38

Whenausercitesapub,weknow:• Whichdataproducedit• What software producedit• Whatwasderived fromit• Whotocreditdowntheattributionstack

• Missier,Paolo.2016."Datatrajectories:trackingreuseofpublisheddatafortransitivecreditattribution."InternationalJournalofDigitalCuration11,no.1,1-16.

• Katz &Smith.2014.ImplementingTransitiveCreditwithJSON-LD.arXiv:1407.5117

• ...

Page 39: A Brief Provenance Tour  … via DataONE

ProvenanceinDataONE search..

search.dataone.orgProvenance@DUG-2017 39

Page 40: A Brief Provenance Tour  … via DataONE

Provenance in Action: Benefits & Impact

ADataONE search(here:“grass”)yieldsdifferentpackageswithprovenance

Provenance@DUG-2017 40

Page 41: A Brief Provenance Tour  … via DataONE

DataONE: Support for ProvenanceYaxing’s script withinputs &outputproducts

Christopher’sYesWorkflow

model

ChristopherusingYaxing’s outputsasinputsforhisscript

Christopher’sresultscanbetracedbackall

thewaytoYaxing’sinput

Provenance@DUG-2017 41

Page 42: A Brief Provenance Tour  … via DataONE

ExploringProvenanceinDataONE

• Let’sgothereè MarkCarls.2017.AnalysisofhydrocarbonsfollowingtheExxonValdezoilspill,GulfofAlaska,1989- 2014.GulfofAlaskaDataPortal.urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171.

42

Page 43: A Brief Provenance Tour  … via DataONE

ATourofProvenance:Overview

§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat,how-to?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance

§ ProvenanceinDataONE

§Acloserlookatprovenance

43

Page 44: A Brief Provenance Tour  … via DataONE

FromWorkflows&ProvenancetoProvenanceforScript-based Workflows…

• Whatworkflowtoolsare(most)scientistsusing?– Workflowsystems– …vsscripts(Python,R,MATLAB,...)

• Whatprovenancetoolsaretheir?– Workflowsystemsupport– Toolsfor“workflow”scripts!?

Provenance@DUG-2017 44

Page 45: A Brief Provenance Tour  … via DataONE

SKOPE:SynthesizedKnowledgeOfPastEnvironmentsBocinsky,Kohleretal.studyrain-fedmaizeof Anasazi

– FourCorners;AD600–1500. ClimatechangeinfluencedMesaVerdeMigrations;late13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspatio-temporalclimatefieldatafairlyhighresolution(~800m)fromAD1–2000.Algorithmestimatesjointinformationintree-ringsandaclimatesignaltoidentify“best” tree-ringchronologiesforclimatereconstructing.

K.Bocinsky,T.Kohler,A2000-yearreconstructionoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature

Communications.doi:10.1038/ncomms6618

… implemented as an R Script … Provenance@DUG-2017 45

Page 46: A Brief Provenance Tour  … via DataONE

ProvenanceSupportforReproducibleScienceExample:PaleoclimateReconstruction

Sciencepaper(OA)uses:• opensourcecode:

– R,PaleoCAR,…

• Isthatallweneed?• Whatwasthe“workflow”?

• Isthereprospectiveand/orretrospectiveprovenance?

Provenance@DUG-2017 46

Page 47: A Brief Provenance Tour  … via DataONE

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?

YesWorkflow:Yes,scriptsareworkflows,too!

• Script vsWorkflows/ASAP:– Automation: *****– Scaling: **– Abstraction: *– Provenance: **

Provenance@DUG-2017 47

Page 48: A Brief Provenance Tour  … via DataONE

YesWorkflow.org• YesWorkflow (YW)

– Startedasagrass-rootseffort(Kurator,SKOPE,..)– …meetingthescientists/userswheretheyR!

• R,Matlab,(i)Python,Jupyter,…

– Scripts+simple userannotations

• =>Revealtheworkflowmodel/abstraction…thatunderliesthe(script)implementation

• =>YWcangiveusmoreofASAP!– FirstYW: ASAP(Abstraction)...– ThenYW-recon:ASAP(reconstructingruntime Provenance)

Provenance@DUG-2017 48

Page 49: A Brief Provenance Tour  … via DataONE

YW (prospective)andYW-Recon (retrospective)Provenance• 1.YW:AnnotateScript=>YWModel

– Annotate@BEGIN..@END,@IN,@OUT– Visualize,share,behappyJ

• 2.Runscript– Filesarereadandwritten– Folder- &Filenameshavemetadata

• 3.YW-Recon– Use@URItagsthatlinkYWModeló PersistedData– RunURI-templatequeries

• cf.“ls -R”&RegEx matching

• 4.YW-Query– Answertheuser’sprovenancequeries

Provenance@DUG-2017 49

Page 50: A Brief Provenance Tour  … via DataONE

YWannotations:Model yourWorkflow!

Provenance@DUG-2017 50

Page 51: A Brief Provenance Tour  … via DataONE

YesWorkflow:Prospective &RetrospectiveProvenance…(almost)forfree!

• YWannotationsinthescript(R,Python,Matlab)areusedtorecreatetheworkflowviewfromthescript…

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!

Provenance@DUG-2017 51

Page 52: A Brief Provenance Tour  … via DataONE

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

PaleoclimateReconstruction(openSKOPE.org)• …explainedusingYesWorkflow!

KyleB.,(computational)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."

Provenance@DUG-2017 52

Page 53: A Brief Provenance Tour  … via DataONE

main

fetch_maskinput_mask_file

load_datainput_data_file standardize_with_mask

land_water_mask

NEE_data simple_diagnosestandardized_NEE_data result_NEE_pdf

YW:Get3viewsforthepriceof1

result_NEE_pdf

input_mask_file land_water_maskfetch_mask

input_data_file NEE_dataload_data

standardized_NEE_data

standardize_with_mask

standardize_with_masksimple_diagnose

fetch_mask land_water_mask

load_data NEE_data

standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf

input_mask_file

input_data_file

Process-centricview- processesarethefocusinworkflowsandincomputationalnarratives

Provenance@DUG-2017 53

Data-centricview- dataarethefocusofdataflow&provenanceinfoandindatanarratives

Combinedview(YWdefault)- dataflow-orientedworkflow&provenancestory

è TowardsAutomatingDataNarratives(Gil&Garijo)

Page 54: A Brief Provenance Tour  … via DataONE

Multi-ScaleSynthesisandTerrestrialModelIntercomparison

Project(MsTMIP)

fetch_drought_variable

drought_variable_1

fetch_effect_variable

effect_variable_1

convert_effect_variable_units

effect_variable_2

create_land_water_mask

land_water_mask

init_data_variables

predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1

define_droughts

sigma_dv_event month_dv_length

detrend_deseasonalize_effect_variable

effect_variable_3

calculate_data_variables

recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2

export_recovery_time_figure

output_recovery_time_figure

export_drought_value_variable_figure

output_drought_value_variable_figure

export_predrought_effect_variable_figure

output_predrought_effect_variable_figure

export_drought_number_variable_figure

output_drought_number_figure

input_drough_variable

input_effect_variable

ChristopherSchwalm,Yaxing Wei

Provenance@DUG-2017 54

Page 55: A Brief Provenance Tour  … via DataONE

✔ Provenancecapture (Matlab,R,Python,…scientificworkflowsystems)✔ Uploading,sharing,linking provenancethroughvariousprovenancetools

✗ Toolsforscientiststoexploit (≠ capture,share,link)provenance fortheirownday-to-daywork.

è Primetheprovenancepump andincreaseprovenancegenerationè Scientistsacceleratetheirworkvianew,active usesofprovenance.

But…howtoprimetheprovenancepump??Must support “ProvenanceforSelf”!

ProvenanceforSelf?!

ProvenanceforOthers

Provenance@DUG-2017 55

Page 56: A Brief Provenance Tour  … via DataONE

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-RECON:Prospective&RetrospectiveProvenance…(almost)forfree!

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

• URI-templateslink conceptualentitiestoruntimeprovenance“leftbehind”bythescriptauthor…

• …facilitatingprovenancereconstructionProvenance@DUG-2017 56

Page 57: A Brief Provenance Tour  … via DataONE

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:Whatsamples didthescriptruncollectimagesfrom?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Provenance@DUG-2017 57

Page 58: A Brief Provenance Tour  … via DataONE

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:Whatenergies wereusedforimagecollectionfromsampleDRT322?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Provenance@DUG-2017 58

Page 59: A Brief Provenance Tour  … via DataONE

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Provenance@DUG-2017 59

Page 60: A Brief Provenance Tour  … via DataONE

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:Whatcassette-idhadthesampleleadingtoDRT240_10000ev_001.img?

Provenance@DUG-2017 60

Page 61: A Brief Provenance Tour  … via DataONE

JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,

VanessaBraganholo,BertramLudascher

Yin&Yang:Demonstrating complementaryprovenancefromnoWorkflow &

YesWorkflow

Page 62: A Brief Provenance Tour  … via DataONE

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

noWorkflow:not onlyWorkflow!

• Scriptshaveprovenance,too!

• Transparently capturesome/allprovenancefromPythonscriptruns.

• Usefilterqueries to“zoom”intorelevantparts..

Provenance@DUG-2017 62

Page 63: A Brief Provenance Tour  … via DataONE

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

lineagequerylineagequery

YesWorkflow:Conceptual workflowmodel

noWorkflow:Python tracemodel

Buthowdowebridgethisgap???

WouldliketouseYWmodeltoqueryNW

data!

Provenance@DUG-2017 63

Page 64: A Brief Provenance Tour  … via DataONE

HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(Thejourneyisthedestination)

LineageofimagefileintermsofYW

model,withdetailsfromNWprovenance

Provenance@DUG-2017 64

Page 65: A Brief Provenance Tour  … via DataONE

DemoTime

Provenance@DUG-2017 65

Page 66: A Brief Provenance Tour  … via DataONE

YW-IDCC’17DemoUseCasesDomain Usecase Programminglanguage Provenancemethods

Climatescience C3C4 MATLAB YW+MATLABRunManager

Astrophysics LIGO Python YW+NW(code-level)

Protein crystalsamples Simulatedatacollection

Python YW+NW(code-level)

Biodiversitydatacuration

kurator-SPNHC Python YW-recon+YW-logging

Socialnetwork analysis Twitter Python YW +NW(file-level)

Oceanography OHIBC Howe Sound(multi-run multi-script)

R YW +RRunManager

Provenance@DUG-2017 66

Page 67: A Brief Provenance Tour  … via DataONE

LIGOexample:Whatstrain_L1_whitenbp dependson…

Overall workflow

Upstreamofstrain_L1_whitenbp

(prospective)

GRAVITATIONAL_WAVE_DETECTION

LOAD_DATA

Load hdf5 data.

strain_H1strain_L1 strain_16 strain_4

AMPLITUDE_SPECTRAL_DENSITY

Amplitude spectral density.

ASDsfile:GW150914_ASDs.png

PSD_H1PSD_L1

WHITENING

suppress low frequencies noise.

strain_H1_whiten strain_L1_whiten

BANDPASSING

remove high frequency noise.

strain_H1_whitenbp strain_L1_whitenbp

STRAIN_WAVEFORM_FOR_WHITENED_DATA

plot whitened data.

WHITENED_strain_datafile:GW150914_strain_whitened.png

SPECTROGRAMS_FOR_STRAIN_DATA

plot spectrogram for strain data.

spectrogramfile:GW150914_{detector}_spectrogram.png

SPECTROGRAMS_FOR_WHITEND_DATA

plot spectrogram for whitened data.

spectrogram_whitenedfile:GW150914_{detector}_spectrogram_whitened.png

FILTER_COEFS

Filter signal in time domain (bandpassing).

COEFFICIENTS

FILTER_DATA

filter data.

filtered_white_noise_datafile:GW150914_filter.png

strain_H1_filtstrain_L1_filt

STRAIN_WAVEFORM_FOR_FILTERED_DATA

plot the filtered data.

H1_strain_filteredfile:GW150914_H1_strain_filtered.png

H1_strain_unfilteredfile:GW150914_H1_strain_unfiltered.png

WAVE_FILE_GENERATOR_FOR_WHITENED_DATA

Make sound files for whitened data.

whitened_bandpass_wavefilefile:GW150914_{detector}_whitenbp.wav

SHIFT_FREQUENCY_BANDPASSED

shift frequency of bandpassed signal.

strain_H1_shifted strain_L1_shifted

WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA

Make sound files for shifted data.

shifted_wavefilefile:GW150914_{detector}_shifted.wav

DOWNSAMPLING

Downsampling from 16384 Hz to 4096 Hz.

H1_ASD_SamplingRatefile:GW150914_H1_ASD_{SamplingRate}.png

FN_Detectorfile:{Detector}_LOSC_4_V1-1126259446-32.hdf5

FN_Sampling_ratefile:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5

fs

upstream(strain_LI_whitenbp) [prospective]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detectorfile:{Detector}_LOSC_4_V1-...

FN_Sampling_ratefile:H-H1_LOSC_{Rate}_V1-...

fs

upstream(strain_L1_whitenbp) [URI-recon]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detector

L-L1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_4_V1-1126259446-32.hdf5

FN_Sampling_rate

H-H1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_16_V1-1126259446-32.hdf5

fs

upstream(strain_LI_whitenbp) [NW-recon]

WHITENING

strain_L1_whitenstrain_L1_whiten = array([8.494, -1.672, ..., 72.156])

AMPLITUDE_SPECTRAL_DENSITY

PSD_L1psd_L1 = scipy.interpolate.interpolate.interp1d

object at 0x113969418

LOAD_DATA

strain_L1strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])

BANDPASSING

strain_L1_whitenbpstrain_L1_whitenbp = array([8.184, 19.935,..., -0.684])

FN_Detectorfn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5

fsfs = 4096

Upstreamofstrain_L1_whitenbp(hybridYW-NWatthecode-

level)

Upstreamofstrain_L1_whitenbp(hybridYW-NWatthefile-level)

• 3inputsspreadacross5 (=2x2+1)files

• Doesintermediatedatastrain_L1_whitenbpdependonall5inputs?

• Intermediatedatastrain_L1_whitenbpdependonlyon2 outof5inputs!

Provenance@DUG-2017 67

Page 68: A Brief Provenance Tour  … via DataONE

Finer-grainedProvenance:UserLogFiles!

Provenance@DUG-2017 68

Page 69: A Brief Provenance Tour  … via DataONE

Conclusions1:DataONE ProvenanceataGlance

§ProvONE extensionstoW3CPROV§Newfirst-classobjectsinDataONE

§ Figures,Software

§ExtendedDataPackagewithProvenance§NewTools

§ Provenanceindexing§ Websearchandbrowseinterfaceforprovenance§ Matlab toolforgeneratingprovenance§ Rtoolforgeneratingprovenance§ YesWorkflow toolformodelingprovenancefromscripts

69

Page 70: A Brief Provenance Tour  … via DataONE

• Provenance …– …iskeytotransparency,reproducibilty,comprehensibility– ...comesinmany(hybrid)forms(workflow graphs,logfiles,traceevents,...)– …ismetadata(=>“alovenotetothefuture”)– …shouldbeactionabletoday(feedboth,yourIGM&RDM)

• Provenance-for-Self …– ...asks:howdoesprovenancehelpmegetmy

workdonetoday?– … iswhatprovenancetechnologistsandtool

buildersshoulddomoreof!

Conclusions2:TowardsProvenance-for-Self

Provenance@DUG-2017 70

Insidethemindofamasterprocrastinator(TEDTalkbyTimUrban)

Page 71: A Brief Provenance Tour  … via DataONE

• SKOPE: systemandtoolstodiscover,access,analyze,visualizepaleoenvironmental data– unprecedentedabilitytoexploreprovenance

(detailed,comprehensiblerecordofcomputationalderivationofresults)

– forresearchers,tinkerers,andmodelers– è SKOPEPosterbyAndrewBrown

• WholeTale:– leverage&contributetoexistingCItosupportthe

wholetale(“livingpaper”),fromworkflowruntoscholarlypublication

– integratetools&CI(DataONE,EC!?,Globus,iRODS,NDS,...)tosimplifyuseandpromotebestpractices.

– drivenbyscienceWGs(archaeology/SKOPE,materialsscience,astro,bio..)

Conclusions3(otherprojects)

Provenance@DUG-2017 71

Page 72: A Brief Provenance Tour  … via DataONE

Timeallowing…

…thoughtsonreproducibilityandprovenance...

Provenance@DUG-2017 72

Page 73: A Brief Provenance Tour  … via DataONE

Provenance@DUG-2017 73

Page 74: A Brief Provenance Tour  … via DataONE

PRIMAD(whathaveyou“primed”?)

Provenance@DUG-2017 74

Dagstuhl Seminar#16041Report Outputs=Exec(M,I,P,D)|RO,A- M=parsimony/bootstrap/..- I=packageXYZ- P=MacOS ..- D=(Params,Files)

Page 75: A Brief Provenance Tour  … via DataONE

PRIMAD(whathaveyou“primed”?)

Provenance@DUG-2017 75

Dagstuhl Seminar#16041Report

Page 76: A Brief Provenance Tour  … via DataONE

ReproducibilityCrisis(reprised)

• Successful reproducibilitystudy:• increases trust inpriorstudyJ• …butnosurprisesL

• Failed reproducibilitystudy:• decreasestrust (orfalsifies)priorstudyL• …butsurprising failureyieldsnewinfo/knowledgeJ

• Learningfromfailures!– Notreallyanew,revolutionaryidea..– Whatisapositivevsnegativeresultanyways?– ...failearly,failoften...

Provenance@DUG-2017 76