Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
The Alan Turing Institute08/09/2017Data as code: Data management for reproducible research
1
Data as codeData management for reproducible research
Martin O’ReillyPrincipal Research Software Engineer
The Alan Turing Institute
@martinoreilly | @turinginst
The Alan Turing Institute
The Alan Turing Institute is the national centre for data science, headquartered at the British Library.
08/09/2017Data as code: Data management for reproducible research
Turing Research Engineering• Radka Jersakova• May Yong• Tim Hobson• James Geddes• James Hetherington
Turing Research Fellows• Kirstie Whitaker• Tomas Petricek
2@martinoreilly | @turinginst
The Alan Turing Institute
Data management for reproducible research
08/09/2017Data as code: Data management for reproducible research
3@martinoreilly | @turinginst
The Alan Turing Institute
FAIR Data Principles
408/09/2017Data as code: Data management for reproducible research
Source: FORCE11 website. https://www.force11.org/group/fairgroup/fairprinciples. Accessed on 07 Sep 2017
• Findable
• Accessible
• Interoperable
• Re-usable
The Alan Turing Institute
Code management for reproducible research
• How do I get your code?
• Online repositories and persistent archives with versioning support
• How do I use your code?
• Documentation, examples, packages, virtual machines, containers
• How do I trust your code?
• Tests, examples, readable code
• How do I build on your code?
• Documentation, readable code, tests
• What am I allowed to do with your code?
• Licence
508/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute
Data management for reproducible research
• How do I get your data?
• Online repositories with versioning and APIs for data access
• How do I use your data?
• Documentation, metadata, common data formats, data packages
• How do I trust your data?
• Record of provenance and processing, versioning
• How do I build on your data?
• Record of provenance and processing, compatible content, linkable to other data
• What am I allowed to do with your data?
• Licences, terms of use, data access agreements, ethics
608/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute 7
Good examples
08/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute
UN Comtrade database
8
Web API for programmatic access
08/09/2017Data as code: Data management for reproducible research
Can apply current and historical classification codes to entire dataset
Can select subset of data to retrieve along multiple dimensions
Source: Screenshot of UN Comtrade database website. https://comtrade.un.org/data. Accessed on 06 Sep 2017
The Alan Turing Institute
UN Comtrade database
9
Third-party R package available for querying web API
08/09/2017Data as code: Data management for reproducible research
Source: Screenshot from Comtradr R package Github README.md. https://github.com/ChrisMuir/comtradr. Accessed on 06 Sep 2017
The Alan Turing Institute
ConnectomeDB
1008/09/2017Data as code: Data management for reproducible research
Source: Screenshot of ConnectomeDB login page. https://db.humanconnectome.org. Accessed on 06 Sep 2017
Website requires registration and login
The Alan Turing Institute
ConnectomeDB
1108/09/2017Data as code: Data management for reproducible research
One-time click for acceptance of terms
Generate dedicated Amazon AWS access credentials
Source: Screenshot of ConnectomeDB main page. https://db.humanconnectome.org. Accessed on 06 Sep 2017
The Alan Turing Institute
The Gamma
12
Dot-driven development• Intellisense autocomplete for
data exploration• Interactive dynamic data
preview• Uses F# type providers• For more details, see
http://tomasp.net/academic/papers/pivot/
08/09/2017Data as code: Data management for reproducible research
Source: The Gamma homepage. https://thegamma.net/. Accessed on 06 Sep 2017
The Alan Turing Institute
The Gamma
1308/09/2017Data as code: Data management for reproducible research
Source: UK National Statistics Public Expenditure Statistical Analyses 2016. Chapter 5 table 5.2. https://www.gov.uk/government/statistics/public-expenditure-statistical-analyses-2016/. Accessed on 06 Sep 2017
Subtotals indicated by background colour
Sub-sub categories indicated by text formatting
Sub categories indicated by initial numerals
The Alan Turing Institute
The Gamma
1408/09/2017Data as code: Data management for reproducible research
Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017
The Alan Turing Institute
The Gamma
1508/09/2017Data as code: Data management for reproducible research
Source: Gamma @ The Turing: Accounting for Democracy. http://gamma.turing.ac.uk/expenditure/. Accessed on 06 Sep 2017
The Alan Turing Institute 16
Dream data
08/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute
My wish list
• Repository supporting versioning and content-aware sub-setting
• Data includes raw and processed data, with code to replicate processing
• Content-aware, on-demand differential download
• Automatable access to data requiring an access agreement / authentication
• Data accessible as native code objects
• Documentation accessible in context of data presentation
• Standard, machine-readable licences
• Repository tracks download / usage stats
1708/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute
Interesting tools
Repositories
• Figshare, Zenodo, Dataverse, DataONE, Dryad
Data access
• Repository APIs, rOpenSci, SPARQL
Data formats
• RDF, OWL, Research object bundles, BagIt, Frictionless data
Differencing data
• Daff (tables), data-diff (JSON), data-diff (Python)
Provenance / processing record
• Workflow platforms (e.g. Galaxy), execution capture tools (e.g. Sumatra)
1808/09/2017Data as code: Data management for reproducible research
The Alan Turing Institute 19
turing.ac.uk@turinginst
08/09/2017Data as code: Data management for reproducible research
[email protected]@martinoreilly