17
Whither Small Data? Some Thoughts on Managing Research Data February 26, 2013 Anita de Waard VP Research Data Collaborations, Elsevier RDS [email protected]

Whither Small Data?

Embed Size (px)

DESCRIPTION

My talk for the BRDI meeting at Washington DC http://sites.nationalacademies.org/PGA/brdi/PGA_080945

Citation preview

Page 1: Whither Small Data?

Whither Small Data? Some Thoughts on Managing Research Data

February 26, 2013Anita de Waard

VP Research Data Collaborations, Elsevier [email protected]

Page 2: Whither Small Data?

Why should data be saved?A. Hold scientists accountable: – Preserve record of scientific process, provenance– Enable reproducible research

B. Do better science: – Use results obtained by others!– Improve interdisciplinary work

C. Enable long-term access:– Use for technology transfer; societal/industrial

development– Reward scientists for data creation (credit/attribution)– Allow public/others insight/use of results

Data Preservation

Data Use

Sustainable Models

Page 3: Whither Small Data?
Page 4: Whither Small Data?

> 50 My Papers2 M scientists

2 My papers/year

Where The Data Goes Now:

Dryad: 7,631 files

Dataverse:0.6 My

Datacite: 1.5 My

MiRB: 25k

PetDB: 1,5 k

Majority of data(90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

Page 5: Whither Small Data?

> 50 My Papers2 M scientists

2 My papers/year

Key Needs:

Dryad: 7,631 files

Dataverse:0.6 My

Datacite: 1.5 My

MiRB: 25k

PetDB: 1,5 k

Majority of data(90%?) is stored

on local hard drives

Some data (8%?) stored in large,

generic data repositories

TAIR: 72,1 k

PDB: 88,3 k

SedDB: 0.6 k

A small portion of data (1-2%?) stored in small,

topic-focuseddata repositories

INCREASE DATA PRESERVATION

IMPR

OVE DAT

A USE

DEVELOP SUSTAINABLE MODELS

Page 6: Whither Small Data?

A. Data Preservation:• Issues: – Currently data is often used by single researchers or

small groups: many different, idiosyncratic formats– Often not in electronic form (maps, images)– No metadata: when, where, by whom, WHY was this

data collected?• Needs: – Tools to make data export/storage simple and

unavoidable– Policies that make data sharing mandatory and simple– Systems that reward data sharing/digitisation

Page 7: Whither Small Data?

B. Data Use:• Issues: – In generic data repositories, data cannot be used

because of inadequate metadata, lack of quality review, lack of provenance

– It’s expensive to make data useable!– Domain-specific data stores are not cross-

searchable across discipline/national borders• Needs:– Standardised metadata systems across

systems/repositories and tools to apply them easily– Integration layers to enable cross-repository queries– A funding model to enable long-term preservation

Page 8: Whither Small Data?

C. Sustainable Models:• Issues: – Many successful domain-specific data repositories

are running out of funding– Is adding metadata something you want to keep

paying PhD+ scientists to do? – Unclear who foots the bill: the researcher? The

institute? The grant agency? For how long?• Needs: – Attribution models for rewarding scientists– Policies to improve cross-domain and cross-national

collaborations– Funding models to sustain databases long-term

Page 9: Whither Small Data?

Linking papers to research data:

9

Database Object Linked Displayed

Pangaea Google Maps Location Map with location

Protein Databank PDB Protein 3d Protein Visualisation

Genbank Gene Name NCBI Gene Viewer

Exoplanets + Exoplanet name Rich Information on extrasolar Planets

Species + Species name Rich information on species

Page 10: Whither Small Data?

Calculate, coordinate…

Compile, comment, compare…

6. Allow apps/tools to integrate

Towards ‘wrapping papers around data’1. Store metadata on all materialsmetadata

metadata

metadata

metadata

metadata

5. Invite reviews; open data to trusted parties, at trusted time

2. Track the methods while doing them

4. Don’t ‘send’ your papers – just expose them to the outside world

ReviewEdit

Revise

Rats were subjected to two grueling tests(click on fig 2 to see underlying data). These results suggest that the neurological pain pro-

3. Write papers that ‘wrap around’ this

Page 11: Whither Small Data?

Research Data Services:

A. Increase Data Preservation: Help increase the amount and quality of data preserved and shared

B. Improve Data Use: Help increase the value and usability of the data shared by increasing annotation, normalization, provenance enabling enhanced interoperability

C. Develop Sustainable Models: Help measure and deliver credit for shared data, the researchers, the institute, and the funding body, enabling more sustainable platforms.

Page 12: Whither Small Data?

Guiding Principles of RDS:• In principle, all open data stays open and URLs,

front end etc. stay where they are (i.e. with repository)

• Collaboration is tailored to data repositories’ unique needs/interests- ‘service-model’ type: – Aspects where collaboration is needed are discussed– A collaboration plan is drawn up using a Service-Level

Agreement: agree on time, conditions, etc. • Transparent business model• Very small (2/3 people) department; immediate

communication; instant deployment of ideas

Page 13: Whither Small Data?

Three pilots: 1. Carnegie Mellon Electrophysiology Lab:

A. Data Input: Develop a suite of tools to enable simple data capturing on a handheld device, add metadata during experiment, store with raw traces and create dashboard for viewing

B. Data Use: Integrate with NIF and eagle-I ontologies, enable access through NIF; combine with other sources

2. ImageVault, with Duke CIVM:A. Data Input: Get 3D image data into common format,

resolution, annotated to allow comparisonB. Data Use: View other image data sets & do image

analytics C. Sustainable Models: Create funding for 3D image sets:

free layer for raw data/subscription analytics.

Page 14: Whither Small Data?

3. IEDA Data Rescue Process Study Data Rescue: – Identify 3 -5 data sets that need to be ‘rescued’– Work with investigators to identify data sources,

formats– Work with IEDA to define metadata standards,

quality checks etc.

Data Rescue Process: – A group of data wranglers perform ‘electrification’

and annotation– (Open source) software is developed where needed,

to help this process– We help develop common standards, if needed

Page 15: Whither Small Data?

3. IEDA Data Rescue Process Study Data Rescue Process Study: Jointly publish a report on a ‘gap analysis’ comparing where are we now vs. and where we need to be, including:– What we did (data imported, processes/standards

created/described; software built; user tests, outcomes)

– Effort involved (time, software, equipment, skills, etc)– How easy it would be to scale up; what part of data

out there could be done this way.– Recommendations for tools and skills that are

needed, if we want to scale up this process

Page 16: Whither Small Data?

Summary:• Three key issues:

A. Data PreservationB. Data UseC. Sustainable Models

• Elsevier’s approach: – Linking data to papers– Wrap papers around data– Explore role in the research data space

• Elsevier RDS: – Three pilots (CMU, Duke, IEDA) to investigate issues– We’ll report back in about a year!

Page 17: Whither Small Data?

Questions?

Anita de Waard VP Research Data Collaborations, Elsevier

[email protected]