JISC and the Big (Research) Data Challenge
Simon HodsonJISC Programme Manager, Managing Research Data
Thursday 10 May 2012
Eduserv Symposium: Big Data
Why is managing research data important?
JISC considers it a priority to support universities in improving the way research data is managed and, where appropriate, made available for
reuse.
� Research funder policies, legislative frameworks, good practice, open data
agenda
– The outputs of publicly funded research should be publicly available.
– The evidence underpinning research findings should be available for
validation
� Good data management is good for research
– More efficient research process, avoidance of data loss, benefits of data reuse
� Alignment with university missions.
– Universities want to provide excellent research infrastructure.
– Universities want to have better oversight of research outputs.
Estimated Research Data Requirements
Two Russell Group Universities
� Estimated current data holdings of c.2PB (managed and unmanaged)
� Currently provide 800TB/300TB in a central storage facility, not all of which is
used (but will be full in 12-18 months)…
� Significant amount of data in temporary storage, external drives etc…
� ‘the more groups we go to talk to, the more we're hearing of significant data holdings on external hard drives and small RAID systems’
1994 Group University
� No central research data provision.
� Faculties (medicine, business, humanities) have 20-30TB each.
� Engineering currently has 170TB faculty system, urgent need to expand.
� But… one group, recently interviewed, currently has 250TB, only half in ‘managed storage’; will reach PB levels in the next few years.
DUDs
The data centre
under the desk (or
in a back pack) is
not adequate.
Why manage research data?
� Not just about storage or avoiding data loss…!
� It’s about knowing what to keep and what to throw away…
� Important to extract maximum return on investment from publicly funded research.
� Access to underlying data is essential for verification and therefore research integrity.
� Opportunities to extract more knowledge from existing data, new analysis.
� It’s about making the most out of data created!
Making Data Meaningful and Reusable
JISC and Research Data
1. Understanding the problem (pre-2007-2009)
2. Prototyping solutions (2009-11)
3. Hardening solutions and building institutional capacity (2011-13)
4. Developing elements of national infrastructure (2013+)
1: Understanding the Problem
Key JISC reports:
� Dealing with Data:
http://www.ukoln.ac.uk/ukoln/staff/
e.j.lyon/reports/dealing_with_data_
report-final.pdf
� Keeping Research Data Safe:
http://www.jisc.ac.uk/media/docum
ents/publications/keepingresearch
datasafe0408.pdf
� Skills, Role, Career Structure of
Data Scientists and Curators:
http://www.jisc.ac.uk/media/docum
ents/programmes/digitalrepositorie
s/dataskillscareersfinalreport.pdf
Other:
� UKRDS Scoping Study:
http://www.ukrds.ac.uk/resources/
Prototyping Solutions:First MRD Programme, 2009-11
� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
RDM Infrastructure (guidance/support, systems)
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (targeted at disciplinary needs)
Challenges of data citation and publication
Building Institutional Capacity:First MRD Programme, 2009-11
� Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
� Projects shortly to be announced for research data publication and developing RDM
training materials: http://bit.ly/jiscmrd-2012-Call
RDM Infrastructure (policy, guidance/support, systems)17 large projects
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (disciplines and libraries/research support)
Innovative data publication
A holistic approach…
Leadership and Policy Development
Guidance and Training
Support for Data Management
Planning
RDM Systems and Infrastructure
Publication, Citation and Discovery Mechanisms
How to develop RDM services
In development!
Why develop services?
Roles and responsibilities
Process of service development
The components / building blocks
• Policy
• Data Management
Planning
• Storage
• Data registry.....
Getting started
Examples and
case studies to
develop into
toolkitSlide Credit: Sarah Jones and Martin Donnelly, DCC
Next steps? Elements of a national infrastructure
� Journals are increasingly implementing policies requiring availability of underlying data.
� Registry of Journal Data Policies to help researchers and research
administrators understand the implications and changing landscape.
� Universities are developing catalogues of research data holdings.
� National registry of research data to facilitate discovery, reuse; better
understanding of impact and research landscape.
Thank You!
� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
� Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11
� Programme Blog: http://researchdata.jiscinvolve.org/
� MRD Project Blogs: http://tiny.cc/MRDblogs
� Twitter: #jiscmrd
� E-mail: [email protected]
� Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray-
Rust, David Shotton, Martin Donnelly, Sarah Jones.
From prototype to platform…
DataFlow Project: http://www.dataflow.ox.ac.uk/
UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
The JISC UMF DataFlow Project
DataStage file system
Researchers
DataBank repository
Researchers, other users
SWORD deposit
� DataBank is a generic repository, and
can be used to store things other that
research datasets, for example data
management plans (DMPs)
� DataStage is a file management system
� A DataStage data package consists of
selected data files accompanied by an
RDF metadata manifest, with a SWORD
v2 wrapper