19
The Alan Turing Institute 08/09/2017 Data as code: Data management for reproducible research Data as code Data management for reproducible research Martin O’Reilly Principal Research Software Engineer The Alan Turing Institute @martinoreilly | @turinginst

Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute08/09/2017Data as code: Data management for reproducible research

1

Data as codeData management for reproducible research

Martin O’ReillyPrincipal Research Software Engineer

The Alan Turing Institute

@martinoreilly | @turinginst

Page 2: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

The Alan Turing Institute is the national centre for data science, headquartered at the British Library.

08/09/2017Data as code: Data management for reproducible research

Turing Research Engineering• Radka Jersakova• May Yong• Tim Hobson• James Geddes• James Hetherington

Turing Research Fellows• Kirstie Whitaker• Tomas Petricek

2@martinoreilly | @turinginst

Page 3: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

Data management for reproducible research

08/09/2017Data as code: Data management for reproducible research

3@martinoreilly | @turinginst

Page 4: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

FAIR Data Principles

408/09/2017Data as code: Data management for reproducible research

Source: FORCE11 website. https://www.force11.org/group/fairgroup/fairprinciples. Accessed on 07 Sep 2017

• Findable

• Accessible

• Interoperable

• Re-usable

Page 5: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

Code management for reproducible research

• How do I get your code?

• Online repositories and persistent archives with versioning support

• How do I use your code?

• Documentation, examples, packages, virtual machines, containers

• How do I trust your code?

• Tests, examples, readable code

• How do I build on your code?

• Documentation, readable code, tests

• What am I allowed to do with your code?

• Licence

508/09/2017Data as code: Data management for reproducible research

Page 6: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

Data management for reproducible research

• How do I get your data?

• Online repositories with versioning and APIs for data access

• How do I use your data?

• Documentation, metadata, common data formats, data packages

• How do I trust your data?

• Record of provenance and processing, versioning

• How do I build on your data?

• Record of provenance and processing, compatible content, linkable to other data

• What am I allowed to do with your data?

• Licences, terms of use, data access agreements, ethics

608/09/2017Data as code: Data management for reproducible research

Page 7: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute 7

Good examples

08/09/2017Data as code: Data management for reproducible research

Page 8: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

UN Comtrade database

8

Web API for programmatic access

08/09/2017Data as code: Data management for reproducible research

Can apply current and historical classification codes to entire dataset

Can select subset of data to retrieve along multiple dimensions

Source: Screenshot of UN Comtrade database website. https://comtrade.un.org/data. Accessed on 06 Sep 2017

Page 9: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

UN Comtrade database

9

Third-party R package available for querying web API

08/09/2017Data as code: Data management for reproducible research

Source: Screenshot from Comtradr R package Github README.md. https://github.com/ChrisMuir/comtradr. Accessed on 06 Sep 2017

Page 10: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

ConnectomeDB

1008/09/2017Data as code: Data management for reproducible research

Source: Screenshot of ConnectomeDB login page. https://db.humanconnectome.org. Accessed on 06 Sep 2017

Website requires registration and login

Page 11: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

ConnectomeDB

1108/09/2017Data as code: Data management for reproducible research

One-time click for acceptance of terms

Generate dedicated Amazon AWS access credentials

Source: Screenshot of ConnectomeDB main page. https://db.humanconnectome.org. Accessed on 06 Sep 2017

Page 12: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

The Gamma

12

Dot-driven development• Intellisense autocomplete for

data exploration• Interactive dynamic data

preview• Uses F# type providers• For more details, see

http://tomasp.net/academic/papers/pivot/

08/09/2017Data as code: Data management for reproducible research

Source: The Gamma homepage. https://thegamma.net/. Accessed on 06 Sep 2017

Page 13: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

The Gamma

1308/09/2017Data as code: Data management for reproducible research

Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5.2. https://www.gov.uk/government/statistics/public-expenditure-statistical-analyses-2016/. Accessed on 06 Sep 2017

Subtotals indicated by background colour

Sub-sub categories indicated by text formatting

Sub categories indicated by initial numerals

Page 14: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

The Gamma

1408/09/2017Data as code: Data management for reproducible research

Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017

Page 15: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

The Gamma

1508/09/2017Data as code: Data management for reproducible research

Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017

Page 16: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute 16

Dream data

08/09/2017Data as code: Data management for reproducible research

Page 17: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

My wish list

• Repository supporting versioning and content-aware sub-setting

• Data includes raw and processed data, with code to replicate processing

• Content-aware, on-demand differential download

• Automatable access to data requiring an access agreement / authentication

• Data accessible as native code objects

• Documentation accessible in context of data presentation

• Standard, machine-readable licences

• Repository tracks download / usage stats

1708/09/2017Data as code: Data management for reproducible research

Page 18: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute

Interesting tools

Repositories

• Figshare, Zenodo, Dataverse, DataONE, Dryad

Data access

• Repository APIs, rOpenSci, SPARQL

Data formats

• RDF, OWL, Research object bundles, BagIt, Frictionless data

Differencing data

• Daff (tables), data-diff (JSON), data-diff (Python)

Provenance / processing record

• Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra)

1808/09/2017Data as code: Data management for reproducible research

Page 19: Data as code - Research Software Engineers Association · • James Hetherington Turing Research Fellows • Kirstie Whitaker • Tomas Petricek @martinoreilly| @turinginst2. The

The Alan Turing Institute 19

turing.ac.uk@turinginst

08/09/2017Data as code: Data management for reproducible research

[email protected]@martinoreilly