20
Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO www.wf4ever-project.org

Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

Embed Size (px)

Citation preview

Page 1: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

Scientific Data Management -

From the Lab to the Web

Semantic Data ManagementDagstuhl Seminar22-27 April 2012

José Manuel Gómez Pérez, iSOCO

www.wf4ever-project.org

Page 2: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

2

Some factsThe data deluge

Source: IDC ‘s The 2011 Digital Universe Study – Extracting Value from Chaos

» In 2010 the size of the digital universe exceeded 1 Zettabyte (=1 trillion Gb)

» 1.8 Zb in 2011» 35 Zb expected in 2020

» 90% unstructured data» 70% user-generated» 75% resulting from data copying,

merging, and transforming

» Metadata is the fastest growing data category

» Much of such data is dynamic, real-time, volatile

Page 3: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

3

Two main challengesDealing with dynamicity

» Challenge 1: Identifying and structuring the relevant portions of the data for the task at hand

› First-class data citizens

» Challenge 2: Managing the lifecycle of data entities

› Preservation› Evolution and versioning› Decay

Both technical and social aspects involved

Page 4: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

4

Experiment Results (data)

Scientific Interpretatio

n

Workflows in the Scientific MethodThe Research Lifecycle

Example: Genome-Wide Association Studies

BackgroundHypothesis

AssumptionsInput data

Method

PublicationResults(Data)

Page 5: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

5

Workflow-based Science

» A mechanism for coordinating the execution of services and linking together resources.

» The combination of data and processes into a configurable, structured set of steps that implement semi-automated computational solutions in scientific problem-solving

What is a Scientific Workflow?

Scientific workflows are at the core of scientific data management

› Enable automation› Encourage best practices

Page 6: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

Challenge 1

Identifying and structuring the relevant portions of the data for the task at

hand

First-class data citizens

Page 7: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

7

Questions for Scientific Data and Workflows IssuesWho are you ? Where and when were you born ? Who were your parents (creators) ?

Identity and DescriptionAuthenticityUniqueness

For which purpose were you conceived and have been used ? Reuse, Repurpose

What do you have inside ? InspectionVisualizationAnnotations

How is your content linked ? Graphical Representation

May I access all your parts ? Access Rights

Which parts can I replace ? Adaptability

What have they done to you ? Who and When ? Why did they do that ?

ProvenanceVersioning

Why have you been recommended to me ? Can I believe what you are saying or trust your results ?

Information Quality

Do you still produce the same results ? Reproducibility

Are you still working ?How could I repair you ?

Completeness Stability

How could I thank you ? How could I talk about you ?

Credit

Page 8: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

8

Research Objects as Technical ObjectsChallenge 1: Identifying and structuring the relevant data

Carriers of Research Context» Referentiable» Aggregation, Dispersed

› Heterogeneous › Local and External

» Annotated metadata› Provenance› Structured: Manifests,

Recipes, Permissions, Discourse

» Lifecycle › Publishing, Evolution› Versioning

» Mixed Stewardship› Graceful Degradation

» Sharing» Security & Privacy

» Stereotypical User Profiles» Services

Distributed Third Party Tenancy

Alien Store

Technical Objects Social Objects

OAI-ORE

Page 9: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

99 9

Research Objects as Social Objects

Package, Explore, Inspect, Review, Exchange, Share, Reuse, Publish, Credit

Page 10: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

10

Research Object model core (simplified)http://purl.org/wf4ever/ro#

ro:Resourcero:ResearchObject

ro:Manifest

ro:AggregatedAnnotation

ore:aggregates

ro:annotatesAggregatedResource

wfdesc:Workflow

ore:isDescribedBy

Note: This figure shows a simplified view of the RO core.

RO specification: http://wf4ever.github.com/ro

› ro (aggregation and annotation)› wfdesc (workflow description)› Minim* (minimum info model)› wfprov (workflow provenance)› roprov (RO provenance)› roevo (evolution model)

*Minim based on M. Gamble’s MIM

Page 11: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

Challenge 2

Managing the lifecycle of data entities

Evolution and Decay

Page 12: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

12

RO Evolution & VersioningChallenge 2: Managing the lifecycle of data entities

Page 13: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

13

Workflow Decay• Component level• flux/decay/unavailability• Data level• Infrastructure level

Experiment Decay• Methodological changes• New technologies• New resources/components• New data

RO DecayChallenge 2: Managing the lifecycle of data entities

Page 14: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

14

Preservation, Conservation, Recreating

PreservingArchived RecordFixed SnapshotsReviewRerun & Replay

ConservingActive InstrumentLiveRerun & ReuseRepair & Restore

RecreatingArchived RecordActive InstrumentLiveRebuild Recycle Repurpose

Page 15: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

15

Possible types of decay (an example)Challenge 2: Managing the lifecycle of data entities

Page 16: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

16

A Taxonomy of RO decayDecay Analysis

1. Service tool is missing

2. Service file descriptor disappeared

3. Service up but not contactable

4. Service up but functionality changed

5. Local software dependencies

6. Data unavailability

7. Changes in data formats

8. Chained dependency

9. Credentials deprecated

10. Input data superseded by other data

11. RO metadata outdated (upon versioning)

12. Old fashioned RO

13. External references lose credit

14. Execution framework no longer available

Page 17: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

17

Sample decay typeA taxonomy of workflow decay

Page 18: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

18

1.0 Certificate – Evaluation of Stability and CompletenessDecay Analysis

Is the RO free from any form of decay preventing workflow execution?

» Focus on reproducibility» Assisted detection of RO decay» Active monitoring on decay forms» RO and workflow provenance

Is the minimal aggregation of resources encapsulated by the RO consistent?

» RO checklists» Produced by scientists» Automatically checked against

minimal model (minim)» RO evolution

Stability Completeness

1.0 Certificate notion originally proposed by Yde de Jong

1.0 Certificate of quality

» Notification» Explanation

Page 19: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

19

Lessons learntRecap

» Data with a Purpose

» Encapsulate & Conquer› Goal-driven (purpose)› Aggregation› Community-managed

» Nothing is immutable, especially data.

› Foster evolution › Monitor decay

Scalability

Provenance

Page 20: Scientific Data Management - From the Lab to the Web Semantic Data Management Dagstuhl Seminar 22-27 April 2012 José Manuel Gómez Pérez, iSOCO

20

QuestionsThanks for your Attention!

Any Questions?

http://www.wf4ever-project.org/