Data sharing & the nih data catalog

Preview:

DESCRIPTION

 

Citation preview

Contributing to the Big Data to Knowledge Initiative at the

NIH

Data Sharing and the NIH Data Catalog

Big Data to Knowledge (BD2K)

Big Data 2 Knowledge

Data Catalog Frameworks and Standards

Policies and Data Sharing

Data Sharing Repositories

All NIH-funded data sharing repositories that are open to receiving data submissions from any researcher internationally - whether they are funded by the NIH or not

Data Sharing Policies

All data sharing policies that exist within the NIH that assist researchers in developing a plan to share their research data

Big Data 2 Knowledge

Data Catalog Frameworks and Standards

Policies and Data Sharing

NIH Data Catalog

Bringing Data Into the Research Ecosystem

Each dataset will be identified via Data Unique Identifier [DUID] (in NIH Data Catalog and in the associated journal)

Datasets specified in catalog using MeSH (creation of a dataset Publication Type)

Datasets are discoverable

NIH Data Catalog produces citable data publications

Citability + proper credit = incentives to submit and publish data

Datasets are citable

Data citations linked between and across the NIH Data Catalog with their associated scientific publication in PubMed/PubMed Central

Datasets are linked to the literature

Analysis of trends, impact of data, effect on NIH research funding

Datasets become information in the research

ecosystem

Common Metadata Elements

How do current data repositories describe their data?

NIH Data Sharing Repositories

Identifying Metadata Commonalities

Identifying Metadata Commonalities

Common Metadata Elements

Authorship

Data Description

Date Information

Building a Taxonomy of Metadata Descriptors

• Authorshipo Attributiono Authorso Creator(s)o Data Authorso Data Ownero Data Attributiono Contributor(s)o PI Name(s)o Investigator(s)o Sequence Authorso Responsible Partyo Data Providero Submitter

• Title informationo Name Titleo Collection Typeo Type of Deposito Service Nameo Image File Nameo File Nameo Data Collection Titleo Dataset Titleo Dataset Name and Accessiono Submission Titleo Lab Data Titleo Research Objective

Common Metadata Elements

Common Metadata Elements

Common Metadata Elements

Mapping Metadata to Existing

Standards

Mapping to DataCite

• DataCite Metadata Schemao Identifiero Creatoro Titleo Publishero PublicationYearo Subjecto Contributoro Dateo Resource Typeo RelatedIdentifiero Rightso Descriptiono Size, Format, Version

• Common Metadata Elementso Data Unique Identifiero Authorshipo Data Titleo Data Locationo Data Completion/Release

Dateo Data Descriptors (controlled

vocabulary)o Data Submitter/Affiliationo Date Informationo Data File Typeso Related Resourceso Access Data Restrictionso Data Description (narrative)

Mapping to Dryad

• Dryad Metadata Schemao dcterms:identifier/Data

Package Identifiero dcterms:creator/Authoro dcterms:title/Data Package

Titleo dcterms:relation/Location of

related content outside of Dryad

o dcterms:available/Date Available

o dcterms:descriptiono dcterms:subject/Keywordo dwc:scientificNameo dcterms:references/

Associated Dryad publication record ID

• Common Metadata Elementso Data Unique Identifiero Authorshipo Data Titleo Data Locationo Data Completion/Release

Dateo Data Descriptors (controlled

vocabulary)o Data Submitter/Affiliationo Date Informationo Data File Typeso Related Resourceso Access Data Restrictionso Data Description (narrative)

Mapping to MEDLINE

Common Metadata Elements

Proposed Definition

Data Unique Identifier A unique ID string that identifies a dataset within the catalog

Author Individuals involved in producing or contributing to data

Affiliation Affiliation of each author associated with the appropriate author occurrence

Data Title Name or title by which the dataset is known

Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number

Date The year, month and date when the data was made available

Data Description (structured narrative) Structured narrative description for efficient indexing

Data Descriptors Metadata describing data contents using controlled labels (i.e. Organism, Disease, Perturbation, Gender, Cell type, etc.)

PMIDs Identifier that will link dataset to associated article(s)

Availability/Accessibility of Data Whether the data is available to use and how to access it

Award Number Grant/award numbers associated with the dataset

Version The version of the dataset (represented as a unique record)

Data Citation - ICMJE

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Author

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Title

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Location

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Date data is submitted and paper is ready to publish

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

NIH Data Catalog Volume (Issue)

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Data Unique Identifier

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

PMID Assigned to NIH Data Catalog Record

Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.

SI: dbGaP/pht002543.v2.p1

Secondary source ID (Link to actual dataset)

NIH Data Catalog Issues and Concerns

What are we missing?

How many NIH datasets actually

exist?

How many unique NIH datasets are

NOT represented in existing data repositories?

Could these datasets be represented as a

data publication instead of in a repository?

If the datasets are already housed

somewhere – do we need a one stop

shop?

Is a NIH Data Catalog the best

solution?

Next Steps• Find out how many datasets are currently in NIH

data sharing repositorieso How many datasets do these repositories process per year?

• How many datasets are unique and NOT housed in a repository?o Search PubMed and PubMed Central and assign categories

• MeSHo PT: Electronic Supplementary materialo SH: Statistical and numerical datao MeSH: Databases, Factual

o Statistical Analysis – exclude datasets that already have a location

• How do we manage these unique datasets?

Questions?Thank you.

Recommended